Closed XSAM closed 1 month ago
Now, the runner seems to work again. https://github.com/open-telemetry/opentelemetry-go/actions/runs/10715454343/job/29710949026
I am curious whether someone fixed the issue or the runner healed itself.
We haven't encountered any issue like this recently. I will close this for now.
Feel free to re-open if other people encounter similar issues.
It happens again:
https://github.com/open-telemetry/opentelemetry-go/actions/runs/10825818975
System.IO.IOException: No space left on device
https://github.com/open-telemetry/opentelemetry-go/actions/runs/10818056607/job/30012831798
Warning: Failed to restore: ENOSPC: no space left on device, write
cc @tylerbenson
also see https://cloud-native.slack.com/archives/C01NJ7V1KRC/p1725475267605189
Some job is generating a lot of 1GB+ logs in the /tmp
directory:
...
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5408 item_index=item_5408 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5536 item_index=item_5536 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5537 item_index=item_5537 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5538 item_index=item_5538 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5444 item_index=item_5444 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5539 item_index=item_5539 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5544 item_index=item_5544 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5520 item_index=item_5520 a=test b=5 c=3 d=true
...
Perhaps the collector @codeboten?
Each job should really clean up the /tmp
directory before or after executing. I'm not really sure how to enforce this better.
Alternatively the TC could decide to schedule a restart every week to ensure the /tmp
directory is cleaned, perhaps on Sunday to reduce risk of interrupting an active test.
For the time being, I followed this guide and added a script that executes find /tmp -user "ghrunner" -delete
at the end of each job execution. We'll see if that helps.
@XSAM It should be fixed now, but please reconsider running your performance job so frequently. It looks like your job takes over an hour to run. That is entirely too long to be run on ever merge to main. Remember, this is a single instance shared by all OTel projects. You should either make it run in under 15 minutes, or limit it to only run daily.
Affected Repository
https://github.com/open-telemetry/opentelemetry-go
Requested changes
Need to investigate the
Error: No space left on device
issue of this runner while initiating jobs. https://github.com/open-telemetry/opentelemetry-go/actions/runs/10705102088/job/29682643790The runner would fail the job before doing any tasks, and the Go SIG cannot solve such a situation, as we lack the context of the running environment and don't have access to the bare metal machine.
Purpose
https://github.com/open-telemetry/opentelemetry-go needs a runnable benchmark runner to run benchmarks.
Repository Maintainers