mlcommons / logging

MLPerf™ logging library
https://mlcommons.org/en/groups/best-practices-benchmark-infra
Apache License 2.0
30 stars 45 forks source link

Handle cases where it takes too long to process logs. #242

Closed emizan76 closed 2 years ago

emizan76 commented 2 years ago

I believe this happens mainly when the submission_division field is many times in the log. It can appear as many times as the number of accelerators.

This can be fixed by enforcing submission_* log lines to appear exactly once, or by processing them just once.

Also we could add a timer to the logging infra to flag cases where it takes too long to process the submission package. The submission-ui has a timeout, probably we should flag the submission before having it submitted.

This is quite important as one submitter (Dell) caused the submission_ui to timeout, and they did not get an answer for their submission and did not know what to do. The problem was fixed by increasing the timeout of the submission-ui to a large value.

hanyunfan commented 2 years ago

This can be fixed by enforcing submission_* log lines to appear exactly once, or by processing them just once.

"By processing them just once" is what we preferred. When we run it with singularity on multiple nodes, submission_division will be printed by each node. If enforcing submission_division to be included exactly once, we have to manually edit logs to have extra ones removed, we can do that, but we want to avoid manually touching the log when possible.