Closed rob-p closed 2 years ago
Hi @rob-p,
I've found out that using /usr/bin/time
to get the max resident set size is misleading for programs using external memory
(but it is ok for program running entirely in internal memory).
Put simple, the reason is that max rss mixes used memory with cached memory written to files.
The proper way to get a measurement is via the free
command (which will display the same information as the graphical htop
command).
An example of free -mh
:
total used free shared buff/cache available
Mem: 62Gi 1.8Gi 18Gi 1.0Mi 43Gi 60Gi
which shows the total
ram on my machine (62Gi) and the ram used
(1.8Gi) at some point
of execution of the sshash construction.
Note that the available
+used
is always =total
at any time, where available
is the memory that can be used by any other process if needed.
Now, we see that the system is caching 43Gi (buff/cache
).
Some of these Gi are due to, in our case, writing to files on disk.
Note, again, the free
+buff/cache
is also equal to total
. In fact, the system may decide
to uncache some files to reclaim memory for another process.
(Disk caching is good to speed up further loading of files from disk and it seems it can't be disabled easily.)
The number of pages written to disk during the lifetime of a program will be count as part of the max resident set size,
using the time
command.
So, to conclude, the used
column is what counts and this is the real internal memory consumed.
If you re-run the same construction and inspect the memory usage with free
(or htop
), you will get
the expected result.
To be extra sure, I just re-built an sshash index with
./build /data2/DNA/Homo_sapiens.GRCh38.dna.toplevel.ust_k31.fa.gz 31 20 -d tmp_dir/ -o human.sshash
and memory used
is at most 3.5GiB.
Hope my explanation helps!
This is also a good reference: http://www.linfo.org/free.html.
Thanks so much @jermp! I'll close this now and let you know if I run into anything else :).
One challenge with free
seems to be that it gives a global "top-level" view of the current memory usage, rather than tracks a specific process. Do you normally do this by just stopping everything else on your system and measuring the delta with free
during program execution? Is there any way to use something similar but "track" a specific process / command line invocation rather than to globally monitor this resource system-wide?
Yes, you're right. So far I've used it as you said ("taking" the delta on an unloaded machine; the system always takes some available ram anyway). I've read there are some ways get a periodic sample from free but I'm not sure if it can be "attached" to a program. I need to do some research about it.
Hi @jermp,
So I've not had the ability to run this on an isolated quiet machine yet, because we're running it on a server. However, I did try the following. I created a cgroup
with a memory limit of 4G and ran the process there. After running for a bit, the process was terminated with sig 9, because it exceeded the memory limit of the cgroup.
I tried to use another tool to look at how the memory usage changes over time, and ended up trying procpath, pointed from this StackExchange thread. I can't guarantee this doesn't do what /usr/bin/time
does, but it at least gives me a report over time. The process report looks like this:
What you can see is that it mostly sticks under 5G, but there is a short duration of time where it peaks ~8G. Given the behavior in cgroups
, I think that effect is real (i.e. at some point, it is requiring more than 4G of resident memory), but short-lived.
If this is something you think might be real, I'm happy to share the exact data I'm running on and do some investigation to figure out exactly where in the execution this allocation spike is happening. Since it's just at a single point in execution, it's probably a peak that can be mitigated. Let me know what you think.
Best, Rob
Hi Rob, and thank you for this further experiment. Yes, I'm interested in the data you're trying on. In general, I can tell you that the only part of the execution where I'm not imposing an upper bound on the memory is during the construction of the skew index. (And this is something I can change, of course.) That is, I'm assuming it takes only a small % of the space and keep the tmp storage in memory.
Also, the compressed index is always kept in internal memory. What is the final index space in your case? Because maybe the peak is due to the compressed index itself.
The CPU% spike seems to indicate that the memory is higher when building the MPHF over the minimizers. And that seems quite odd since I'm using the external memory construction of PTHash and allowing 2GB of working ram. Please, share your data so that I can make some experiments on my own. Thanks!
Hi @jermp,
Sure thing! My final index is 2.5G, so I don't think that's the issue. Let me throw the raw input fasta up on a Box link and then post it back here. Ok, you can use this link to grab my exact input file.
--Rob
Hi @rob-p and thanks for sharing the data. I confirm that by running
./build grch38_cuttlefish_k31_seg.fa.gz 31 20 -d tmp_dir/
I built an index in 8min and ram usage was never higher than 4GB for a final index of 3GB.
The (much) higher memory footprint reported by /usr/bin/time
must be due to disk caching
and the use of memory-mapped files (e.g., during merging of sorted files on disk).
I can remove the use of mmap
from the codebase, although it will require some refactoring.
Thanks for testing this out @jermp! In this case, I'm not certain how to explain the cgroups
result. Perhaps there's a cgroup
property that isn't being set that changes the behavior of how mmapped pages are counted. I'll have to look into this or see if there is another tool that can track a specific process and account for this. Out of curiosity, do you know if there's any difference in the way the handling of these files might work on an HDD as opposed to SDD? I'm running on a server with an HDD.
You're very welcome and thanks for the data. I actually do not know the behaviour of cgroups
,
but I run the construction by using a tmp_dir
on both a classic HDD and a faster SSD and obtained the same rss of 13.5GB as reported by time
.
Hi @jermp,
I'm trying out the new external memory construction feature of sshash (super-exciting!). I've noticed that the memory usage seems surprisingly high. For example, I ran on the compacted dBG (not the UST) of the human genome with:
and I am getting ~12G of memory usage. The time is ~16m, but I'd expect this given it's running on spinning rust rather than SSD.
Any idea why the memory usage might be like this? I noticed that your initial mention was that it was ~4G for the human genome with k=31. Is it because of my minimizer size? When I use a minimizer of 15, this goes down to ~9G, which is better, but still larger than I'd have anticipated. Any thoughts on what might be going on here?
Thanks! Rob