Closed jwokaty closed 1 year ago
@sdgamboa I think export_bugphyzz.R might be using more memory than allocated to the default GitHub hosted runner. The script finishes on supermicro, but the GitHub Action exits with 137 code: https://github.com/waldronlab/bugphyzzExports/actions/runs/5931082162. The default GitHub runners are limited to 7GB: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources.
I haven't been successful in determining the actual memory requirement for the script, but I thought I should ask first if there was an issue with the script that was causing the problem. If there is no problem, maybe we can reduce the memory requirement or look for another solution.
@jwokaty, I'm not aware of any issue with the script yet (the script runs and I get the expected results). I'm not entirely sure about how memory works with multi-thread or even if the GH runner has several threads, but maybe adjusting the number of threads could help to use less memory (?). I've been using bplapply with a large object, which I think is the part that uses memory the most.
Or maybe, we could divide the script into several ones?
In GitHub Actions, it is using 2 cores. Could it use 1 core and finish within 6 hours, which I think is the limit for a GHA?
Do you have an idea of how to divide the script?
If this isn't easy to do, we have a few options:
"Do you have an idea of how to divide the script?" R: I think I could write a generic Rscript that would accept an argument from the command line. So it would run a single physiology at a time from the command line. Physiology terms could be stored in a text file. I think this way, the bash session would close the R session and start a new one for each physiology. Although. I would need an extra script to merge the outputs into a single dump file and generate the GMT files.
As for the options you suggested, I'm inclined to the last one ("We could also run another script on supermicro to run the script and push the files to the repo"). It sounds like the easiest to implement and it would be free.
Would the "divided script" work? Or should we test it to determine if it will be within the constraints of the default GitHub runner? For example, if the physiology with the most values required the most memory, could I run the current script with only that physiology to determine if the divided script is a viable option?
The script by default uses bplapply to use all available cores on a multicore machine - on GHA this is two cores, which may double the amount of memory in use. Unless the script runs too long for GHA (more than 6 hours?), simply using one core might solve an out-of-memory problem.
Running this line, it appears that these calls to checkTaxonName
do not consume any meaningful CPU cycles, and the slowness seems to be the response time of taxid2rank (presumably API calls, or download time, or disk access). If there's little use of CPU, then bplapply won't provide any benefit and may have drawbacks like memory usage. Can you clarify what advantage the use of parallel jobs provides here?
In my own benchmarking (just using system.time()
) I found that the first two bplapply instances run almost instantly using just a simple for loop, and the third takes up to a minute per iteration but almost none of this is system (CPU) time, so bplapply won't help. You should avoid using parallelism unless it provides a clear advantage, because it increases memory usage and makes troubleshooting more difficult. Also, I think a for loop in these instances is much preferable to lapply
or bplapply
because you can print intermediate progress messages, and it should be more memory-efficient too because memory cleanup occurs at each iteration rather than having to collect all elements before constructing the list. I'll provide a pull request with my suggested changes.
[1] "Propagating acetate producing at 2023-09-04 12:25:49.591593"
user system elapsed
15.102 0.310 15.412
[1] "Propagating aerophilicity at 2023-09-04 12:26:05.998356"
user system elapsed
61.203 1.166 62.373
[1] "Propagating animal pathogen at 2023-09-04 12:27:09.372295"
user system elapsed
37.954 0.444 38.415
[1] "Propagating antimicrobial resistance at 2023-09-04 12:27:48.795001"
user system elapsed
53.264 1.150 54.432
[1] "Propagating antimicrobial sensitivity at 2023-09-04 12:28:44.309908"
user system elapsed
28.179 0.575 28.775
[1] "Propagating arrangement at 2023-09-04 12:29:14.083124"
user system elapsed
36.336 0.703 37.058
[1] "Propagating biofilm forming at 2023-09-04 12:29:52.20434"
user system elapsed
24.357 0.493 24.856
@lwaldron I see it's no longer failing with an error 137. Thanks! (I'll fix the new issue in another issue.)
GitHub Action Workflow fails with exit code 137, which may be an out of memory error.
Need to confirm this is the problem. If it is, we need to reduce the memory requirement. See https://github.com/waldronlab/bugphyzzExports/actions/runs/5931082162/job/16082181204 and https://www.airplane.dev/blog/exit-code-137.