waldronlab / bugphyzzExports

1 stars 1 forks source link

Improve memory efficiency #25

Closed lwaldron closed 8 months ago

lwaldron commented 1 year ago

This particular chain containing only dplyr::mutate calls seems to spike memory usage temporarily up to over 30GB (although this might be much less on a lower-memory machine, this is just what I observed in top). Need to make the whole script more memory-efficient to work on GHA.

https://github.com/waldronlab/bugphyzzExports/blob/a9fc18914cb3b1d9ea3a3d1c0121ccac5c8d482a/inst/scripts/export_bugphyzz.R#L281

lwaldron commented 1 year ago

Additionally:

> print(object.size(propagated), units = "Gb")
2.9 Gb

and as soon as full_dump_with_0 is created:

> print(object.size(full_dump_with_0), units = "Gb")
2.7 Gb

so we need to do some cleanup of large objects sitting in memory.

lwaldron commented 1 year ago

https://github.com/waldronlab/bugphyzzExports/blob/a9fc18914cb3b1d9ea3a3d1c0121ccac5c8d482a/inst/scripts/export_bugphyzz.R#L313

is a memory-intensive calculation that has already been done for full_dump_with_0

lwaldron commented 1 year ago

Multiple changes made for memory efficiency in https://github.com/waldronlab/bugphyzzExports/pull/26/commits. I will go ahead and merge so that it will be tested in GHA.

lwaldron commented 1 year ago

Keep an eye on https://github.com/waldronlab/bugphyzzExports/actions/runs/6076545052 but this should finish within the GHA time limit. Note the outputs of system.time() like this:

   user  system elapsed 
113.827   0.104 113.949 

show that very little of the time spent in propagation is system time (CPU), in this case 0.104s out of 113.949s elapsed time. I assume that most of the time is spent doing NCBI lookups or something, and that if you could eliminate that bottleneck, propagation would take a fraction of a second per attribute. It is still feasible though, and priority is on implementing a "real" ASR method that provides probabilities or confidence intervals.

lwaldron commented 1 year ago

Made it almost to the end and dies with error code 137 (memory) while writing to disk. A little more cleanup should probably be enough.

https://github.com/waldronlab/bugphyzzExports/actions/runs/6076545052/job/16484731212

lwaldron commented 1 year ago

Just putting pryr::mem_change statements around each line in the loop where error 137 is still occurring, I see that the following line is the one that requires a lot of memory:

https://github.com/waldronlab/bugphyzzExports/blob/79ad45812da975e0a9ed206835eca7c5bcd42f80/inst/scripts/export_bugphyzz.R#L294

So I am trying it without in c0f1166