jermp / sshash

A compressed, associative, exact, and weighted dictionary for k-mers.
MIT License
84 stars 17 forks source link

External memory construction seems too high #10

Closed rob-p closed 2 years ago

rob-p commented 2 years ago

Hi @jermp,

I'm trying out the new external memory construction feature of sshash (super-exciting!). I've noticed that the memory usage seems surprisingly high. For example, I ran on the compacted dBG (not the UST) of the human genome with:

/usr/bin/time ./build  unitigs.fa 31 19 -d /scratch/shash_tmp -o grch38_k31_seg.sshash --verbose

and I am getting ~12G of memory usage. The time is ~16m, but I'd expect this given it's running on spinning rust rather than SSD.

2316.51user 47.52system 16:27.23elapsed 239%CPU (0avgtext+0avgdata 12690096maxresident)k
8155944inputs+50695376outputs (1major+8309647minor)pagefaults 0swaps

Any idea why the memory usage might be like this? I noticed that your initial mention was that it was ~4G for the human genome with k=31. Is it because of my minimizer size? When I use a minimizer of 15, this goes down to ~9G, which is better, but still larger than I'd have anticipated. Any thoughts on what might be going on here?

Thanks! Rob

jermp commented 2 years ago

Hi @rob-p, I've found out that using /usr/bin/time to get the max resident set size is misleading for programs using external memory (but it is ok for program running entirely in internal memory).

Put simple, the reason is that max rss mixes used memory with cached memory written to files.

The proper way to get a measurement is via the free command (which will display the same information as the graphical htop command).

An example of free -mh:

           total        used        free      shared  buff/cache   available
 Mem:       62Gi       1.8Gi        18Gi       1.0Mi        43Gi        60Gi

which shows the total ram on my machine (62Gi) and the ram used (1.8Gi) at some point of execution of the sshash construction.

Note that the available+used is always =total at any time, where available is the memory that can be used by any other process if needed. Now, we see that the system is caching 43Gi (buff/cache). Some of these Gi are due to, in our case, writing to files on disk.

Note, again, the free+buff/cache is also equal to total. In fact, the system may decide to uncache some files to reclaim memory for another process.

(Disk caching is good to speed up further loading of files from disk and it seems it can't be disabled easily.)

The number of pages written to disk during the lifetime of a program will be count as part of the max resident set size, using the time command.

So, to conclude, the used column is what counts and this is the real internal memory consumed. If you re-run the same construction and inspect the memory usage with free (or htop), you will get the expected result.

To be extra sure, I just re-built an sshash index with

./build /data2/DNA/Homo_sapiens.GRCh38.dna.toplevel.ust_k31.fa.gz 31 20 -d tmp_dir/ -o human.sshash

and memory used is at most 3.5GiB.

Hope my explanation helps!

jermp commented 2 years ago

This is also a good reference: http://www.linfo.org/free.html.

rob-p commented 2 years ago

Thanks so much @jermp! I'll close this now and let you know if I run into anything else :).

One challenge with free seems to be that it gives a global "top-level" view of the current memory usage, rather than tracks a specific process. Do you normally do this by just stopping everything else on your system and measuring the delta with free during program execution? Is there any way to use something similar but "track" a specific process / command line invocation rather than to globally monitor this resource system-wide?

jermp commented 2 years ago

Yes, you're right. So far I've used it as you said ("taking" the delta on an unloaded machine; the system always takes some available ram anyway). I've read there are some ways get a periodic sample from free but I'm not sure if it can be "attached" to a program. I need to do some research about it.

rob-p commented 2 years ago

Hi @jermp,

So I've not had the ability to run this on an isolated quiet machine yet, because we're running it on a server. However, I did try the following. I created a cgroup with a memory limit of 4G and ran the process there. After running for a bit, the process was terminated with sig 9, because it exceeded the memory limit of the cgroup.

I tried to use another tool to look at how the memory usage changes over time, and ended up trying procpath, pointed from this StackExchange thread. I can't guarantee this doesn't do what /usr/bin/time does, but it at least gives me a report over time. The process report looks like this:

paste (1)

What you can see is that it mostly sticks under 5G, but there is a short duration of time where it peaks ~8G. Given the behavior in cgroups, I think that effect is real (i.e. at some point, it is requiring more than 4G of resident memory), but short-lived.

If this is something you think might be real, I'm happy to share the exact data I'm running on and do some investigation to figure out exactly where in the execution this allocation spike is happening. Since it's just at a single point in execution, it's probably a peak that can be mitigated. Let me know what you think.

Best, Rob

jermp commented 2 years ago

Hi Rob, and thank you for this further experiment. Yes, I'm interested in the data you're trying on. In general, I can tell you that the only part of the execution where I'm not imposing an upper bound on the memory is during the construction of the skew index. (And this is something I can change, of course.) That is, I'm assuming it takes only a small % of the space and keep the tmp storage in memory.

Also, the compressed index is always kept in internal memory. What is the final index space in your case? Because maybe the peak is due to the compressed index itself.

jermp commented 2 years ago

The CPU% spike seems to indicate that the memory is higher when building the MPHF over the minimizers. And that seems quite odd since I'm using the external memory construction of PTHash and allowing 2GB of working ram. Please, share your data so that I can make some experiments on my own. Thanks!

rob-p commented 2 years ago

Hi @jermp,

Sure thing! My final index is 2.5G, so I don't think that's the issue. Let me throw the raw input fasta up on a Box link and then post it back here. Ok, you can use this link to grab my exact input file.

--Rob

jermp commented 2 years ago

Hi @rob-p and thanks for sharing the data. I confirm that by running

./build grch38_cuttlefish_k31_seg.fa.gz 31 20 -d tmp_dir/

I built an index in 8min and ram usage was never higher than 4GB for a final index of 3GB. The (much) higher memory footprint reported by /usr/bin/time must be due to disk caching and the use of memory-mapped files (e.g., during merging of sorted files on disk).

I can remove the use of mmap from the codebase, although it will require some refactoring.

rob-p commented 2 years ago

Thanks for testing this out @jermp! In this case, I'm not certain how to explain the cgroups result. Perhaps there's a cgroup property that isn't being set that changes the behavior of how mmapped pages are counted. I'll have to look into this or see if there is another tool that can track a specific process and account for this. Out of curiosity, do you know if there's any difference in the way the handling of these files might work on an HDD as opposed to SDD? I'm running on a server with an HDD.

jermp commented 2 years ago

You're very welcome and thanks for the data. I actually do not know the behaviour of cgroups, but I run the construction by using a tmp_dir on both a classic HDD and a faster SSD and obtained the same rss of 13.5GB as reported by time.