psathyrella / partis

B- and T-cell receptor sequence annotation, simulation, clonal family and germline inference, and affinity prediction
GNU General Public License v3.0
55 stars 34 forks source link

Exorbitantly high memory usage during full/naive hamming partitioning #192

Closed krdav closed 6 years ago

krdav commented 8 years ago

During my runs on large data files I have observed the following: Partitioning fails for two reasons:

  1. The scheduler kills it, presumably because of overuse of memory
  2. partis kills itself, presumably because of running out of free memory

The interesting thing is that vsearch works fine, and very fast too, but neither naive hamming nor full clustering works. They both fail in the initial phase of clustering (according to the partis output).

Here is an example of a job trace I get:

05/25/2016 17:21:24  A    queue=batch
05/26/2016 13:13:31  A    user=krdav group=cu_10049 account=cu_10049 jobname=krdav_job queue=batch ctime=1464189684 qtime=1464189684
                          etime=1464189684 start=1464261211 owner=krdav@computerome02.cm.cluster exec_host=risoe-r12-cn490/0-27
                          Resource_List.neednodes=1:ppn=28:thinnode Resource_List.nodect=1 Resource_List.nodes=1:ppn=28:thinnode
                          Resource_List.walltime=200:00:00
05/26/2016 21:44:23  A    user=krdav group=cu_10049 account=cu_10049 jobname=krdav_job queue=batch ctime=1464189684 qtime=1464189684
                          etime=1464189684 start=1464261211 owner=krdav@computerome02.cm.cluster exec_host=risoe-r12-cn490/0-27
                          Resource_List.neednodes=1:ppn=28:thinnode Resource_List.nodect=1 Resource_List.nodes=1:ppn=28:thinnode
                          Resource_List.walltime=200:00:00 session=29635 total_execution_slots=28 unique_node_count=1 end=1464291863
                          Exit_status=0 resources_used.cput=180:19:13 resources_used.energy_used=0 resources_used.mem=128785880kb
                          resources_used.vmem=135766724kb resources_used.walltime=08:30:50

Notice this: resources_used.mem=128785880kb

Which is around the 128gb of memory our HPC thinnodes have. I am now running a it on a fatnode with 1500gb memory so lets see if that works, however I think this is not a very sustainable way of thinking, and clustering a 150mb sequence file should be possible with a couple of gb of memory per core.

Point being that you might have a memory leak that should be addressed before partis is able to run large sequence files.

psathyrella commented 8 years ago

Hm, thanks. Well it's always possible that a memory leak has snuck in, and I'll go through and check, but I think there's a few other things going on. The main one is that most everything about clustering (as opposed to annotation) scales quadratically with the number of sequences -- so while annotating a file should only use of order the memory equal to the input file size, for clustering that starts not being true. vsearch uses much less memory for the same reason it is faster -- it applies a bunch of heuristics to avoid all-against-all comparison; but this is the same reason it's much less accurate. Both the partis methods have a ton of optizimations to reduce the number of comparisons, and to make the comparisons in the fastest way possible, but they still are close to all-against-all.

So assuming there isn't a new bug, the main memory consumption is that we cache likelihood ratios and naive sequences and things that we've already calculated. Without this caching, the problem would be totally intractable. But! that said, there definitely is scope for more memory optimization (see #160), and I'll definitely bump that up my todo list.

Extrapolating from a fasta here, I'm guessing your 150MB files is around 700 thousand sequences? That's certainly up around the upper end of what I've ever run with the full clustering, and when I was running, managing memory consumption was required (e.g. running on a node with a lot of memory, only running one at a time), but the memory usage was pretty much as I expected -- it spikes when cache files are being read and merged, and whatnot.

krdav commented 8 years ago

Yes, it is around 500-600 thousand sequences, so lets see if it runs through now that I have it running alone on a fatnode.

So I, maybe naively, thought that clustering methods like single linkage were O(n^2) in time and O(n) in space using the right algorithms, but if we are talking O(n^2) in space complexity I understand better that I am running out of memory.

In that case I hope you will be able to find some optimizations that can bring down memory usage because this would be a potential big obstacle when using partis' partitioning on rep-seq samples.

psathyrella commented 8 years ago

Well, I certainly know very little about algorithms, but I think the reasons that we're in a different regime are, 1) our most expensive operation, like by several orders of magnitude, is calculating the "distance" metric 2) the likelihood ratios aren't transitive -- i.e. if A and B are close/far, and A and C are close/far, that doesn't really tell us about A and C, and 3) because of 1) we're switching back and forth pretty dynamically between comparing naive hamming distances and likelihood ratios. Also, this is hierarchical agglomeration, not single-linkage, so not sure how that would affect the theoretical best scaling.

matsen commented 8 years ago

👍 to Duncan's comments.

I should also say that he's done a quite clever job of caching partial results from previous computations in the agglomeration process. This is inherently a tradeoff of more memory usage for less time.

In our hands, time is more constrained than RAM, so it's a win.

psathyrella commented 8 years ago

oh, and for the sake of posterity (i.e. dear future duncan who has forgotten), we do a lot of dp table caching, which speeds things up by a factor of 15 or so, but could contribute a lot to the memory usage.

krdav commented 8 years ago

Well I agree that a 15 times speed up is worth a lot, and most times also worth much more than high memory usage, but I think that the problem should not be ignored.

The idea behind the likelihood based partitioning is soooo much better and robust than clustering based on a more or less arbitrary hamming distance cutoff so it would have great value if the method can be used on big datasets.

One idea pops into my mind. Could you use a the amino acid sequence as a constrain in clustering? E.g. translate all the input sequences and start the clustering based on sequences having the same acids sequence? The step could simply be implemented by making a translation, storing the unique amino acid sequence in a dictionary as keys with all the fasta sequence headers producing this as values for the given key. Then you would have the initial clusters as dictionary values and you would never have to compare sequences within these initial clusters. Of course I am assuming here that the same protein sequence cannot come from two different clones, but I think that for most practical purposes this is a good assumption.

matsen commented 8 years ago

Thanks for your enthusiasm and thoughts, @krdav !

First, I should say that Duncan's thought a whole lot about clustering heuristics and has come up with a system by which we do the expensive likelihood calculation only in a relatively narrow range of situations defined by nucleotide distances. It's true that we could probably do something more efficient for point partis, but that's why we've been using vsearch.

Second, in large repertoires we definitely see significant repetition of sequences on the amino acid level. Despite the remarkable stochasticity of BCRs, the biases in gene usage, trimming, and mutation lead to sequences that are nearly identical on the nucleotide level. You can see this in the supplementary plots in the paper.

This did get me thinking about heuristics again, resulting in #193. Still lots of room for better development, and so keep those ideas coming!

psathyrella commented 8 years ago

A back-of-the-envelope calculation will have noted my claims about most of the memory being because we have to do pairwise comparisons were a bit implausible, since doubles and strings aren't that big. Indeed the "upper level" of caching, of log probs and naive hfracs and naive seqs is a trivial portion. It's mostly chunk caching, and only really in certain circumstances: the space is roughly proportional to the number of V genes we're using times the number of simultaneous seqs, so for instance simply running viterbi simultaneously on 100 sequences (in this case non-clonally related, ends up with 34 V genes) I peak at a couple of GB.

Now, to think about if that can be reduced. Note that #162 should have reduced this a lot from what it could have been.

psathyrella commented 8 years ago

OK, I've done a bunch of clean-up/profiling stuff. I haven't really reduced the amount of memory, but I've removed almost all the dynamic allocation (which, I swear, wasn't mine), so partly because of that and partly because it's valgrind clean I'm pretty sure there's no leaks. It's still not entirely clear what all the memory is going to, though: as far as I can tell I would expect it to bounce around a bit from query to query, but not increase monotonically, but it increases monotonically. I suspect this could be a fragmentation thing -- a lot of memory goes into hash maps storing cached values, and while they get cleared before each new query, they might not be as memory-efficient as one would hope.

In any case, while I've made it a lot faster with some trivial optimizations, I haven't made a dent in the memory. It's clear where to go next with memory, but I need to put this aside for a bit to deal with light chain handling.

psathyrella commented 8 years ago

oh, also note, this is mostly talking about the bcrham processes. The memory usage of the driving python process can also get large, but its memory usage is from different things.

jotwin commented 8 years ago

I tried to partition 30k sequences and ran out of memory (24GB + 24GB swap). The python process occupied around 90% of it when it failed.

psathyrella commented 8 years ago

yeah, that sounds about like the memory consumption I remember. Next I will put some more time into memory optimization on the python side, but I need to work on getting light chain working.

krdav commented 8 years ago

Hi jotwin, Yes it is very unfortunate, but I also agree with Duncan that light chain is more important work at this stage. Instead let me share my experience of solving the problem.

  1. First do clustering with vsearch. This is very fast and will not break your budget with respect to memory.
  2. Decrease the size of your input dataset and then run the full partitioning and compare the output.

How can you decrease the number of sequences in your dataset then? Obviously remove duplicates, then you can require a read to be seen more than just once (like a minimum observation cutoff), next you can remove/merge duplicates in amino acid space etc. Be very strict in your cutoffs and if you really don't want to cut away more, and you still have too many sequences, then take a random sample.

Usually I get that the vsearch makes slightly more partitions. In the order of 20% (+/- 10%) more. Moreover I have tried running a big sequence file with approx. 500,000 sequences on a 1 terabyte memory fatnode and the results does not change much by raising my cutoffs so that I only have 100,000 sequences left. I have even tried raising the bar further so only 30,000 sequences was left and yet still the number of clusters did not change much.

scharch commented 8 years ago

I don't know how much work this would be for you, Duncan, but one option is to offer the user a choice in the trade-off between speed and memory. (For instance, USearch does this with the cluster_fast and cluster_smallmem programs.) Secondly, if I understand this correctly:

the space is roughly proportional to the number of V genes we're using times the number of simultaneous seqs

wouldn't it make sense to cap the number of simultaneous sequences? Thus, even though I only have access to 8 cores at a time right now, the program could still break up the input the same way it would with nprocs=80; it would just end up running those threads in consecutive batches. This could perhaps be implemented with a --max-mem parameter and a rough heuristic on memory usage.

psathyrella commented 8 years ago

yeah, the latter would be possible, if I was more sure about my previous statement about memory usage proportionality. I mean, the bit of memory that I thought was taking up most of the space is proportional to those two things, but at the end of my memory optimization extravaganza I was a lot less sure that that structure was actually the one using most of the memory. If there are good tools to tell you what is taking most of the memory in c++, I'd like to hear about them, because everything I've used is terrible at digging down through the layers of, say, a map<string, map<vector<string>, string> > > type of thing, so it's not always obvious how to account for whatever top or ps is reporting. And then maybe top and ps are reporting larger numbers because you're allocations are fragment-ey, and you're not actually using that much...

We do, actually, have a maximum number of simultaneous sequences, when clustering -- for speed optimization, we subsample large clusters down to seven or so (which is all it turns out you really need to have a good idea of the germline rearrangement, which in turn is all you need in order to decide whether a new sequence is a member. I have plots!). Which, I think because of chunk caching, didn't actually help speed that much in most cases, sigh. But if you're running around comparing a three-thousand-sequence cluster to everybody, it certainly helps.

krdav commented 7 years ago

Adding a follow up on this: Actually the memory usage have gotten better, with orders of magnitude difference with the run I have tried. The cpu time used for partitioning has also decreased by orders of magnitude.