open-telemetry / community

OpenTelemetry community content
https://opentelemetry.io
Apache License 2.0
779 stars 234 forks source link

REQUEST: Repository maintenance on `Benchmark Bare Metal Runners` #2331

Closed XSAM closed 1 month ago

XSAM commented 2 months ago

Affected Repository

https://github.com/open-telemetry/opentelemetry-go

Requested changes

Need to investigate the Error: No space left on device issue of this runner while initiating jobs. https://github.com/open-telemetry/opentelemetry-go/actions/runs/10705102088/job/29682643790

The runner would fail the job before doing any tasks, and the Go SIG cannot solve such a situation, as we lack the context of the running environment and don't have access to the bare metal machine.

Purpose

https://github.com/open-telemetry/opentelemetry-go needs a runnable benchmark runner to run benchmarks.

Repository Maintainers

XSAM commented 1 month ago

Now, the runner seems to work again. https://github.com/open-telemetry/opentelemetry-go/actions/runs/10715454343/job/29710949026

I am curious whether someone fixed the issue or the runner healed itself.

XSAM commented 1 month ago

We haven't encountered any issue like this recently. I will close this for now.

Feel free to re-open if other people encounter similar issues.

XSAM commented 1 month ago

It happens again:

trask commented 1 month ago

cc @tylerbenson

also see https://cloud-native.slack.com/archives/C01NJ7V1KRC/p1725475267605189

tylerbenson commented 1 month ago

Some job is generating a lot of 1GB+ logs in the /tmp directory:

...
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5408 item_index=item_5408 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5536 item_index=item_5536 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5537 item_index=item_5537 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5538 item_index=item_5538 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5444 item_index=item_5444 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5539 item_index=item_5539 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5544 item_index=item_5544 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5520 item_index=item_5520 a=test b=5 c=3 d=true
...

Perhaps the collector @codeboten? Each job should really clean up the /tmp directory before or after executing. I'm not really sure how to enforce this better.

tylerbenson commented 1 month ago

Looks like this can be centralized: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/running-scripts-before-or-after-a-job#triggering-the-scripts

tylerbenson commented 1 month ago

Alternatively the TC could decide to schedule a restart every week to ensure the /tmp directory is cleaned, perhaps on Sunday to reduce risk of interrupting an active test.

tylerbenson commented 1 month ago

For the time being, I followed this guide and added a script that executes find /tmp -user "ghrunner" -delete at the end of each job execution. We'll see if that helps.

tylerbenson commented 1 month ago

@XSAM It should be fixed now, but please reconsider running your performance job so frequently. It looks like your job takes over an hour to run. That is entirely too long to be run on ever merge to main. Remember, this is a single instance shared by all OTel projects. You should either make it run in under 15 minutes, or limit it to only run daily.