Closed AngellusMortis closed 10 months ago
@AngellusMortis thanks for providing the logs to my colleague I have updated the action, please re-run and let us know the outcome
Still happening.
2023-09-22 18:30:36,280 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting logs to otlp.nr-data.net:4318, retrying in 1s.
2023-09-22 18:30:36,280 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,281 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,282 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,282 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,283 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,284 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,285 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,285 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,286 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,287 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,287 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,290 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,290 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
2023-09-22 18:30:36,291 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,292 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,292 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,293 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,294 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,294 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,295 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,295 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,296 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,297 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,297 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,299 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,299 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,300 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,301 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,301 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,302 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,303 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,303 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,304 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,304 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,305 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,306 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,306 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,307 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,308 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,310 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,310 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.DEADLINE_EXCEEDED encountered while exporting logs to otlp.nr-data.net:4318, retrying in 1s.
2023-09-22 18:30:36,311 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,312 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,313 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,314 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,315 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,316 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,317 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 4s.
2023-09-22 18:30:36,318 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,319 INFO [job_logger] [exporter.py:137] [trace_id=029411f4cb7e2b77d11e31fe7144fbab span_id=14828838b8f2ac46 resource.service.name= trace_sampled=True] -
2023-09-22 18:30:36,320 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
2023-09-22 18:30:36,320 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
2023-09-22 18:30:36,321 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
2023-09-22 18:30:36,321 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
2023-09-22 18:30:36,322 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
2023-09-22 18:30:36,325 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
2023-09-22 18:30:36,326 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
2023-09-22 18:30:36,326 WARNING [opentelemetry.exporter.otlp.proto.grpc.exporter] [exporter.py:293] [trace_id=0 span_id=0 resource.service.name= trace_sampled=False] - Transient error StatusCode.UNAVAILABLE encountered while exporting logs to otlp.nr-data.net:4318, retrying in 8s.
I do not know if they provided both sets of logs. One for the this issue (the logs never getting exported and eventually timing out after 30 minutes) and the other was for syntax error in #5
EDIT: I gave Bashayir some more logs from the latest run of it.
Yes, the syntax error was fixed but the timing out I added more logging but seems still going into retry loop without erroring out, are you able to share the full set of logs and your CI.yaml config as well as the logs( stdout) from it. I will try and replicate the setup and see if we get same error
The workflow is literally the example one in the readme. Only thing I added was a continue-on-error: true
on the main job to suppress failure notifications and timeout-minutes: 30
to the export step to make it so it does not run forever.
Unfortunately, I cannot send the full log dumps on the public of our internal CI. Bashayir said he would pass along the new logs and I recommended trying to get you into our Slack Connect channel so we can troubleshoot in real time if you want.
I have improved the log batching process, please have another try and if still having issues, I will need access to live data/config, we will arrange debug session via Bashayir
Is this still an issue on the new version?
It looks like it might be fixed. I do not see any errors from sampling. @Eleeleth can you remove continue-on-error
so we can start getting alerted again on errors?
The original timeout issue seems to be solved, but rarely (~1% of runs) we are seeing this error:
Traceback (most recent call last):
File "/app/exporter.py", line 140, in <module>
child_1.end(end_time=do_time(step['completed_at']))
File "/app/custom_parser/__init__.py", line 13, in do_time
return (int(round(time.mktime(parse(string).timetuple())) * 1000000000))
File "/usr/local/lib/python3.10/site-packages/pyrfc3339/parser.py", line 51, in parse
match = parse_re.match(timestamp)
TypeError: expected string or bytes-like object
I can make it a separate issue if you would rather for tracking purposes.
This is working for me, can I add extra logging and have you re-run it see if I can catch error on your end? I would also need to see you workflow.yaml configuration structure to see if there is anything that differs from standard structure exporter expects to see
use newrelic-experimental/gha-new-relic-exporter@debug1
have you managed to give this a go? is it working for you now?
DEBUG: {'id': ####, 'run_id': ####, 'workflow_name': 'A workflow', 'head_branch': 'master', 'run_url': 'https://api.github.com/repos/org/repo/actions/runs/####,', 'run_attempt': 1, 'node_id': 'XXXXX', 'head_sha': 'XXXXX', 'url': 'https://api.github.com/repos/org/repo/actions/jobs/####', 'html_url': 'https://github.com/org/repo/actions/runs/####/job/####', 'status': 'completed', 'conclusion': 'cancelled', 'created_at': '2023-11-04T07:22:40Z', 'started_at': '2023-11-04T07:22:44Z', 'completed_at': '2023-11-04T07:45:39Z', 'name': 'A job', 'steps': [{'name': 'step 1', 'status': 'completed', 'conclusion': 'success', 'number': 1, 'started_at': '2023-11-04T07:22:43.000Z', 'completed_at': '2023-11-04T07:22:52.000Z'}, {'name': 'step 2', 'status': 'queued', 'conclusion': None, 'number': 2, 'started_at': None, 'completed_at': None}, {'name': 'step 3', 'status': 'queued', 'conclusion': None, 'number': 4, 'started_at': None, 'completed_at': None}], 'check_run_url': 'https://api.github.com/repos/org/repo/check-runs/####', 'labels': ['self-hosted', 'linux', 'x64', 'common'], 'runner_id': ####, 'runner_name': 'XXXXX', 'runner_group_id': ####, 'runner_group_name': 'XXXX'}
Traceback (most recent call last):
File "/app/exporter.py", line 112, in <module>
child_1 = step_tracer.start_span(name=str(step['name']),start_time=do_time(step['started_at']),context=p_sub_context,kind=trace.SpanKind.CONSUMER)
File "/app/custom_parser/__init__.py", line 13, in do_time
return (int(round(time.mktime(parse(string).timetuple())) * 1000000000))
File "/usr/local/lib/python3.10/site-packages/pyrfc3339/parser.py", line 51, in parse
match = parse_re.match(timestamp)
TypeError: expected string or bytes-like object
It is a workflow that was canceled due to a concurrency limit (higher priority job came in). But I imagine it would happen with any workflow that has steps that are not ran for some reason leading to the started_at
being null.
good point, guessing any skipped jobs start time will have to be set to the start of the start_time of the workflow itself. Both start and end times will be the start_time of the workflow, as this will mean they appear as spans with 0 duration and status skipped in NR
I have added logic for skipped steps to the latest version, let me know if this is working for you
This is the only error we have gotten since I switched back over to latest. It was about 16 hours ago. Devs have not really started developing today yet though, so we may still get more.
Traceback (most recent call last):
File "/app/exporter.py", line 121, in <module>
child_1 = step_tracer.start_span(name=str(step['name']),start_time=do_time(step_started_at),context=p_sub_context,kind=trace.SpanKind.CONSUMER)
File "/app/custom_parser/__init__.py", line 13, in do_time
return (int(round(time.mktime(parse(string).timetuple())) * 1000000000))
File "/usr/local/lib/python3.10/site-packages/pyrfc3339/parser.py", line 51, in parse
match = parse_re.match(timestamp)
TypeError: expected string or bytes-like object
(also for future, you should let me switch off the temp branch before deleting it, we had a bunch of failures for debug1 being deleted)
Was not aware you were still using the debug1 tag, sorry about that. Do you have any information about the step that caused that failure/error? I assume is due to the step_started_at. The latest version is currently ignoring all steps with step['conclusion'] = ignored
It looks like it was another canceled step. Same reason, canceled due to higher priority job.
It does look like there was some steps that were skipped completely.
Ok, I have added cancelled status to be ignored, same as skipped. Let me know if that works next time it runs with latest release
Well, cancelled jobs/steps could have still started. Shouldn't you try to ingest them and then safely abort if they have not started?
A good example of a case where cancelled jobs actually still run is during a matrix with fail-fast enabled.
They will still be ingested, but the start time will be the same as the workflow_run start time. I will review tomorrow the example you provided above
Ah okay. That sounds great then. I just wanted to make sure some stuff was not being skipped. I will probably let you know again in a few days if there has been any new errors.
Two more, both look like a similar issue, just slightly different line numbers.
Traceback (most recent call last):
File "/app/exporter.py", line 108, in <module>
resource_attributes.update(create_resource_attributes(parse_attributes(step,""),GHA_SERVICE_NAME))
File "/app/custom_parser/__init__.py", line 56, in parse_attributes
obj_atts[new_Att_name]=do_time_ms(obj[attribute])
File "/app/custom_parser/__init__.py", line 16, in do_time_ms
return (int(round(time.mktime(parse(string).timetuple())) * 1000))
File "/usr/local/lib/python3.10/site-packages/pyrfc3339/parser.py", line 51, in parse
match = parse_re.match(timestamp)
TypeError: expected string or bytes-like object
and
Traceback (most recent call last):
File "/app/exporter.py", line 98, in <module>
child_0.set_attributes(create_resource_attributes(parse_attributes(job,"steps"),GHA_SERVICE_NAME))
File "/app/custom_parser/__init__.py", line 56, in parse_attributes
obj_atts[new_Att_name]=do_time_ms(obj[attribute])
File "/app/custom_parser/__init__.py", line 16, in do_time_ms
return (int(round(time.mktime(parse(string).timetuple())) * 1000))
File "/usr/local/lib/python3.10/site-packages/pyrfc3339/parser.py", line 51, in parse
match = parse_re.match(timestamp)
TypeError: expected string or bytes-like object
Nothing stands out about either workflow run. One of the failures is a workflow with a single job with only 2 steps that succeeded. It would be nice if there was some logs in the output about what workflow/job/step is currently being processed to help with isolating what causing the issue. Output that is always there rather then just temporary debug logging.
I have added permanent logging entries as well as DEBUG option -> https://github.com/newrelic-experimental/gha-new-relic-exporter#troubleshooting I have also updated the parsing logic for attributes of cancelled/skipped steps, which should fix the issue you reported
Closing this out as there have been no notable issues in our workflows except the rate limiting ones discussed in the other issue. Which is unlikely related to this project.
I had to put a timeout on our workflow to do the export action (30 minutes). It looks like it runs into issues uploading the GHA logs.