GitHub Action Workflow fails with exit code 137

jwokaty commented 1 year ago

GitHub Action Workflow fails with exit code 137, which may be an out of memory error.

Need to confirm this is the problem. If it is, we need to reduce the memory requirement. See https://github.com/waldronlab/bugphyzzExports/actions/runs/5931082162/job/16082181204 and https://www.airplane.dev/blog/exit-code-137.

jwokaty commented 1 year ago

@sdgamboa I think export_bugphyzz.R might be using more memory than allocated to the default GitHub hosted runner. The script finishes on supermicro, but the GitHub Action exits with 137 code: https://github.com/waldronlab/bugphyzzExports/actions/runs/5931082162. The default GitHub runners are limited to 7GB: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources.

I haven't been successful in determining the actual memory requirement for the script, but I thought I should ask first if there was an issue with the script that was causing the problem. If there is no problem, maybe we can reduce the memory requirement or look for another solution.

sdgamboa commented 1 year ago

@jwokaty, I'm not aware of any issue with the script yet (the script runs and I get the expected results). I'm not entirely sure about how memory works with multi-thread or even if the GH runner has several threads, but maybe adjusting the number of threads could help to use less memory (?). I've been using bplapply with a large object, which I think is the part that uses memory the most.

sdgamboa commented 1 year ago

Or maybe, we could divide the script into several ones?

jwokaty commented 1 year ago

In GitHub Actions, it is using 2 cores. Could it use 1 core and finish within 6 hours, which I think is the limit for a GHA?

Do you have an idea of how to divide the script?

If this isn't easy to do, we have a few options:

GitHub has larger runners available at a cost
GitHub also allows us to use our own runners; however, I'm not sure how feasible it is as we have 1 "extra" machine but I don't know if we can configure it to be a runner and use it for other things
We could also run another script on supermicro to run the script and push the files to the repo

sdgamboa commented 1 year ago

"Do you have an idea of how to divide the script?" R: I think I could write a generic Rscript that would accept an argument from the command line. So it would run a single physiology at a time from the command line. Physiology terms could be stored in a text file. I think this way, the bash session would close the R session and start a new one for each physiology. Although. I would need an extra script to merge the outputs into a single dump file and generate the GMT files.

As for the options you suggested, I'm inclined to the last one ("We could also run another script on supermicro to run the script and push the files to the repo"). It sounds like the easiest to implement and it would be free.

jwokaty commented 1 year ago

Would the "divided script" work? Or should we test it to determine if it will be within the constraints of the default GitHub runner? For example, if the physiology with the most values required the most memory, could I run the current script with only that physiology to determine if the divided script is a viable option?

lwaldron commented 1 year ago

The script by default uses bplapply to use all available cores on a multicore machine - on GHA this is two cores, which may double the amount of memory in use. Unless the script runs too long for GHA (more than 6 hours?), simply using one core might solve an out-of-memory problem.

https://github.com/waldronlab/bugphyzzExports/blame/a9fc18914cb3b1d9ea3a3d1c0121ccac5c8d482a/inst/scripts/export_bugphyzz.R#L17

lwaldron commented 1 year ago

Running this line, it appears that these calls to checkTaxonName do not consume any meaningful CPU cycles, and the slowness seems to be the response time of taxid2rank (presumably API calls, or download time, or disk access). If there's little use of CPU, then bplapply won't provide any benefit and may have drawbacks like memory usage. Can you clarify what advantage the use of parallel jobs provides here?

https://github.com/waldronlab/bugphyzzExports/blob/a9fc18914cb3b1d9ea3a3d1c0121ccac5c8d482a/inst/scripts/export_bugphyzz.R#L133

lwaldron commented 1 year ago

In my own benchmarking (just using system.time()) I found that the first two bplapply instances run almost instantly using just a simple for loop, and the third takes up to a minute per iteration but almost none of this is system (CPU) time, so bplapply won't help. You should avoid using parallelism unless it provides a clear advantage, because it increases memory usage and makes troubleshooting more difficult. Also, I think a for loop in these instances is much preferable to lapply or bplapply because you can print intermediate progress messages, and it should be more memory-efficient too because memory cleanup occurs at each iteration rather than having to collect all elements before constructing the list. I'll provide a pull request with my suggested changes.

[1] "Propagating acetate producing at 2023-09-04 12:25:49.591593"
   user  system elapsed 
 15.102   0.310  15.412 
[1] "Propagating aerophilicity at 2023-09-04 12:26:05.998356"
   user  system elapsed 
 61.203   1.166  62.373 
[1] "Propagating animal pathogen at 2023-09-04 12:27:09.372295"
   user  system elapsed 
 37.954   0.444  38.415 
[1] "Propagating antimicrobial resistance at 2023-09-04 12:27:48.795001"
       user  system elapsed 
 53.264   1.150  54.432 
[1] "Propagating antimicrobial sensitivity at 2023-09-04 12:28:44.309908"
   user  system elapsed 
 28.179   0.575  28.775 
[1] "Propagating arrangement at 2023-09-04 12:29:14.083124"
   user  system elapsed 
 36.336   0.703  37.058 
[1] "Propagating biofilm forming at 2023-09-04 12:29:52.20434"
   user  system elapsed 
 24.357   0.493  24.856

jwokaty commented 1 year ago

@lwaldron I see it's no longer failing with an error 137. Thanks! (I'll fix the new issue in another issue.)

waldronlab / bugphyzzExports

GitHub Action Workflow fails with exit code 137 #22