iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Tracking pandora compare RAM usage and runtime #148

Closed leoisl closed 3 years ago

leoisl commented 5 years ago

This is not really an issue - just a way of documenting pandora compare RAM usage across commits and to know if we need to improve or not.

Dataset for massif profiling: downsampled klebs dataset with 8000 PRGs and 50 samples at 45x coverage (for the PRG, reads, command-line, etc, see /hps/nobackup/iqbal/leandro/compare_test/profiling_test_8000_PRGs_50_samples on yoda).

Massif profiling:

Git commit: https://github.com/leoisl/pandora/commit/5b329a78d89eaa0f85592258a8fb378677f71587

Massif output: massif.out.DEBUG_5b329a.txt

Peak snapshot screenshot: massif out DEBUG_5b329a

Comments on the 3 instructions allocating most of the RAM:

  1. https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/include/kmergraph.h#L91

Takes 413MB / 1.5GB (26.4% of the RAM): the current RAM bottleneck. The RAM usage of this data structure increases with number of PRGs, number of nodes in PRGs and number of samples. It might be tricky to improve on this.

  1. https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/src/kmergraph.cpp#L299

Takes 226MB / 1.5GB (14.45% of RAM) - this increases only with number of PRGs and number of nodes in PRGs - it does not increase with number of samples, so I think we don't need to worry about improving the memory usage here.

  1. https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/src/pangenome/panread.cpp#L37

Takes 167MB / 1.5GB (10.68% of RAM) - could be improved (see TODOs in https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/include/minihit.h ), but it does not seem a bottleneck.

leoisl commented 5 years ago

Heavy dataset just to measure maximum RAM usage (no massif profiling):

Full klebs dataset with 61340 PRGs and 151 samples at 45x coverage (see /hps/nobackup/iqbal/leandro/compare_test/150_samples_test for details).

Git commit: https://github.com/leoisl/pandora/commit/888fff3f82a0051c78d2528d74c1f6298f7ab965

Maximum RAM usage: 17.1 GB

Info about time performance: Number of requested threads: 16 Percent of CPU this job got: 811% (on average, 8.1 threads working simultaneously) Wall clock time: 5h 21min

leoisl commented 5 years ago

Please tell me if:

  1. Should we track pandora compare RAM usage in another dataset?

  2. Should we improve RAM usage?

  3. Should we improve runtime?

Heavier datasets than the full klebs dataset are very welcome.

iqbal-lab commented 5 years ago

For that first one, surely we don't need a uint32 for number of samples. More tricky to organise would be getting rid of the floats and quantized the likelihoods

rmcolq commented 5 years ago

In that first one we don't have a uint32 for number of samples, we have a pair of uint32 to represent the forward and reverse coverage in each sample on each node in the kmergraph. We still don't need uint32. At some point I remember that we updated almost all uints in pandora to uint32 on Robyn's advice to reduce the possibility of uint conversions. This may be the time to rethink that.

rmcolq commented 5 years ago

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?).

If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those.

Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

leoisl commented 5 years ago

Agreed. This is how quantizing coverage is done in GATB: https://github.com/GATB/gatb-core/issues/19#issuecomment-360741787 ... there are probably many ways to do this... then we would not need a uint32 for the coverage, but we would have some error... I think even a very simple encoding using uint16 in pandora is enough: coverages < 65535 are encoded exacly, coverages >= 65535 are all encoded as 65535. I don't think we will have many of these last cases anyway, assuming uniform read coverage, so the errors would be close to 0 and it would cut the RAM bottleneck by half.

If we want to decrease even further the RAM usage, we could go for uint8 with some errors using for e.g. GATB encoding.

What do you think?

rmcolq commented 5 years ago

That's a neat method! My only hesitation is the idea of using any more GATB...

leoisl commented 5 years ago

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?).

If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those.

Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

My guess is that it scales in RAM for thousands of samples. The peak memory usage shown here is when pandora is mapping a single sample (creating the pangraph from the sample). This does not change with the number of samples, since we process each sample one by one. If we look strictly after the reads are all mapped, we have 1.3 GB of RAM usage. The 2nd and 3rd most RAM consumers are the KmerGraphs and the Index, and this does not depend on number of samples, but rather on the PRGs. The 1st most RAM consumer is the coverage info on all KmerNodes, and that is 415MB for 150 samples. I think it is linear, so for 1000 samples we would have 2.7GB of coverage information (which can be decreased if using uint16 or uint8). So it seems to scale fine, but I could be totally wrong, best thing is to simply run and check, but we need such dataset.

And you are right about runtime - this will be the bottleneck... I'd guess from some days to 1 week or more for 1000 samples, which is not very nice...

rmcolq commented 5 years ago

That is really great!! Runtime it is then...

leoisl commented 5 years ago

That's a neat method! My only hesitation is the idea of using any more GATB...

GATB is already a dependency due to the de novo module, so I think there is no issue of using GATB stuff then... If you prefer, we could replicate what they do for the coverage encoding: https://github.com/GATB/gatb-core/blob/2332064bf74032e801537d43dce8f87f018cea4a/gatb-core/src/gatb/tools/collections/impl/MapMPHF.hpp#L96-L157

leoisl commented 5 years ago

That is really great!! Runtime it is then...

yeah... but these are just guesses, I'd prefer to work with data however, so we would be sure we would be working on the true bottlenecks...

Do we have any dataset with thousands of samples? If not, can we simply duplicate the samples in the klebs dataset until we reach thousands? If exact duplication is not realistic, maybe simulate reads or sth like that?

rmcolq commented 5 years ago

@iqbal-lab said he had some with 1000s to process in the near future. Probably doesn't have them right now.

Ha, let's not reimplement - ignore what I said and either use GATB, or use uint16.

leoisl commented 5 years ago

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?). If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those. Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

My guess is that it scales in RAM for thousands of samples. The peak memory usage shown here is when pandora is mapping a single sample (creating the pangraph from the sample). This does not change with the number of samples, since we process each sample one by one. If we look strictly after the reads are all mapped, we have 1.3 GB of RAM usage. The 2nd and 3rd most RAM consumers are the KmerGraphs and the Index, and this does not depend on number of samples, but rather on the PRGs. The 1st most RAM consumer is the coverage info on all KmerNodes, and that is 415MB for 150 samples. I think it is linear, so for 1000 samples we would have 2.7GB of coverage information (which can be decreased if using uint16 or uint8). So it seems to scale fine, but I could be totally wrong, best thing is to simply run and check, but we need such dataset.

And you are right about runtime - this will be the bottleneck... I'd guess from some days to 1 week or more for 1000 samples, which is not very nice...

Ooooops, I am totally mistaken here.... I computed these stats for the downsampled dataset for massif profiling, and used as if they were on the full dataset... I think linear RAM growth is an upper bound so this seems runnable on yoda, but the best way to know is running such dataset itself

leoisl commented 5 years ago

Ok, let's switch to uint16 for coverage information then - almost no errors and half RAM usage for recording coverage information. Let's go for uint8 only if RAM becomes a bottleneck.

iqbal-lab commented 5 years ago

Yes we do have thousands now! 3000 klebs and 8000 e coli. I think it's runtime that's the performance barrier now and we need to move to parallel across the cluster. We could do huge numbers very fast.

But also in terms of priorities, we have bigger holes to fill about how we run Pandora smoothly, integrating output of de novo back in, make prg etc.

iqbal-lab commented 5 years ago

Re coverage

  1. Now Rachel is potentially doing viruses, beware v high covg samples
  2. Illumina is a bigger issue than Nanopore in terms of repeats causing v high coverages?
leoisl commented 5 years ago

Yes we do have thousands now! 3000 klebs and 8000 e coli. I think it's runtime that's the performance barrier now and we need to move to parallel across the cluster. We could do huge numbers very fast.

But also in terms of priorities, we have bigger holes to fill about how we run Pandora smoothly, integrating output of de novo back in, make prg etc.

Agreed, will then postpone performance improvement in pandora compare for later. But I am interested to know how the current version performs in these huge datasets... Could you point them to me in yoda or ebi-cli?

Re coverage

  1. Now Rachel is potentially doing viruses, beware v high covg samples
  2. Illumina is a bigger issue than Nanopore in terms of repeats causing v high coverages?

We will encode with uint16: coverages < 65535 are encoded exacly, coverages >= 65535 are all encoded as 65535. Do you think this is enough?

iqbal-lab commented 5 years ago

Sounds good to me!

leoisl commented 3 years ago

Closing, outdated. We have better performance tracking with the plots generated to evaluate pandora performance for the paper.