mantidproject / mantid-monitor-dashboard

0 stars 1 forks source link

Error during build when failed test storage exceed 100MB (GitHub single file limit) #3

Open DonaldChung-HK opened 1 year ago

DonaldChung-HK commented 1 year ago

This is the likely cause fo build fail if the dashboard build successfully but auto-commit failed/timeout (saving the new JSON file).

If individual files exceed 100MB GitHub will refuse a push. This will occur either if there is a large influx of new test fail or total number of unique test fail build up over time. This will likely to happen if the tests are python based as they tend to output larger stacktrace.

Here are some solutions on how to deal with such situation.

If it is just a one off large influx of test

  1. If they are just in the pull request pipeline, the system will return to normal after 1-2 days as the search range moved over the problematic build range.
  2. You can also manually delete the problematic builds.
  3. Manually marking the is_completed as true in history/{pipeline_name}/{pipeline_name}_by_build_fail_pickle.json so that the search algorithm skip over the problematic builds.
  4. Put a safe guard to not save/update the stack trace if number of test failed exceed a certain number (a large number of test > 30 would indicate a catastrophic failure which will be dealt with immediately)

If the tests build-up over time:

  1. Periodically clean up the history file as old test fails are irrelevant.
  2. Create a mechanism where tests are checked against the latest detected fail date and they are removed after not detected after a certain date (~60 - 180 days.)

Finally, we can add a process where wesplit the files which exceed 100MB and join them before processing. However, you might run out of memory if the files gets too big as python is inefficient in JSON reading which use ~ 7-8x of memory of JSON size

DonaldChung-HK commented 1 year ago

I think we should create some sort of cleaning python script. such as based trimming a JSON object base on size, build number, and latest detected fail date so that we can keep the file size in check and retire old data

DonaldChung-HK commented 1 year ago

So the problem with the high influx of fail is that for the by-build history files we are saving all the stack traces instead of just the latest one (to prevent actual loss of data since the updating mechanism is not very robust), we were flooded by 10x120 fails per pipeline which exponentially increases the file size to over 150MB when usually it is just around 2MB per file.

I have implemented 2 bash scripts that will break down the files if they are too big (~90MB) and join them when performing GitHub actions in #5