iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Runtime for pandora compare #323

Open AmayAgrawal opened 1 year ago

AmayAgrawal commented 1 year ago

Hi,

I am using pandora compare to call out the variants after building the pan-genome reference graphs and indexing them using pandora index. Although I am using multiple threads, it works with one strain at a time and takes a lot of time to genotype single strain and I have more than 2000 strains to genotype.

Is there any way so that I can parallise it to run on multiple strains at a time? or is there any other way such that I can genotype all samples quickly?

leoisl commented 1 year ago

Hey @AmayAgrawal , sorry for my delay.

pandora compare basically runs in two steps, each using multiple threads:

  1. The first step is to map reads from the samples to the PanRG and infer the most likely path for the samples in each PRG. The multithreading here enters just when mapping the reads to the PanRG, which is the most costly operation. However, one sample is mapped at a time. The log messages you get while doing this are sth like:

    [2023-01-27 15:29:49.315710] [0x00007f8a5ad8d7c0] [info]    Constructing pangenome::Graph from read file reads/toy_sample_1/toy_sample_1.100x.random.illumina.fastq (this will take a while)
    <...>
    [2023-01-27 15:29:49.726961] [0x00007f8a5ad8d7c0] [info]    Processed 700 reads
    <...>
    [2023-01-27 15:29:49.785260] [0x00007f8a5ad8d7c0] [info]    Find max likelihood PRG paths
  2. The second step is to actually do a multi-sample genotyping of each PRG based on the maximum likelihood paths and minimiser coverage gathered from the first step. Here, all PRGs are genotyped multithreadely, with each thread getting one PRG, genotyping it, and getting the next and so on. The log messages you get while doing this are sth like:

    [2023-01-27 15:29:50.218415] [0x00007f8a5ad8d7c0] [info]    Multi-sample pangraph has 2 nodes
    [2023-01-27 15:29:50.227535] [0x00007f8a5ad8d7c0] [info]    Infer VCF reference path
    [2023-01-27 15:29:50.241175] [0x00007f8a5ad8d7c0] [info]    Genotyping VCF...
    [2023-01-27 15:29:50.241479] [0x00007f8a5ad8d7c0] [info]    Finished genotyping VCF
    [2023-01-27 15:29:50.246743] [0x00007f8a5ad8d7c0] [info]    Infer VCF reference path
    [2023-01-27 15:29:50.259194] [0x00007f8a5ad8d7c0] [info]    Genotyping VCF...

I think from reading your messages, you are on step 1. Could you please confirm? If yes, you'd know how many samples you've mapped with a command like grep "Constructing pangenome::Graph from read file" pandora.log | wc -l. Just wondering if it is finishing this step or still far from finishing.

The maximum number of strains I believe we have used pandora compare was 114 illumina samples, but it is a special use case as our PRG contained only plasmid genes, so most of the reads don't map. That took 2.5h, but in our paper to run on 20 e-coli samples, it took ~10h, both runs with 16 threads. So I could see 100 samples taking at least a couple of days, possibly more, and 2000 samples possibly 1000h.

The current ways around this is to increase the number of threads, if possible. We are working to scale pandora up, but unfortunately our most pressing need is to scale pandora to handle millions of PRGs, and not yet thousands of samples. Handling thousands of samples is in the roadmap though, and we might start working on this within some months...

AmayAgrawal commented 1 year ago

Hi @leoisl,

Thanks for the explanation. As you predicted, currently I am in the first step where the reads from the samples are being mapped to the PanRG to infer the most likely path for the samples in each PRG. I am running the analysis with 64 threads and I am half way through the samples in 2 weeks time and I guess that I have to wait same amount of time for it to finish.

Yes, I think it would be nice in the future to have the functionality where it can handle thousands of samples quickly.

iqbal-lab commented 1 year ago

Sorry it is currently too slow! Pandora is inherently parallelisable by sample when mapping, and then by gene when doing 'compare', and could then in principle run very fast on a cluster. I don't know if you have access to one. But we need to make a number of updates to enable this, and we have to finish our current changes first. As leandro said, right now we're modifying it so even with enormous pan genomes, the ram use is controlled. This will take several weeks to finish. Apologies for the delay.

AmayAgrawal commented 1 year ago

Hi, I have a small question regarding this analysis. So as I mentioned earlier, I started the pandora compare analysis for around 2000 strains. As mentioned above by @leoisl, pandora works in two steps: first is to map the reads from the samples to the PRG and second is to do multi-sample genotyping of each PRG. I think that I was towards the end of the second step and then the process was automatically killed today. Below are the log messages that I got

[2023-02-11 12:40:45.847849] [0x00007fe77ce1a700] [info]    Genotyping VCF...
[2023-02-11 12:40:47.344218] [0x00007fe77ce1a700] [info]    Finished genotyping VCF
[2023-02-11 12:58:17.154721] [0x00007fe76da79700] [info]    Genotyping VCF...
[2023-02-11 12:58:17.831671] [0x00007fe76da79700] [info]    Finished genotyping VCF
[2023-02-13 03:17:11.856584] [0x00007fe75f25c700] [info]    Genotyping VCF...
[2023-02-13 03:17:20.521788] [0x00007fe75f25c700] [info]    Finished genotyping VCF
[2023-02-13 09:47:52.348881] [0x00007fe77027e700] [info]    Genotyping VCF...
[2023-02-13 09:47:52.938793] [0x00007fe77027e700] [info]    Finished genotyping VCF
Killed

From the above log, it can be seen that for almost two weeks, there was no output log written (last one was written on 13th Feb) and also nothing was written in the output folder. So I think that pandora was in the last step of making one final multi sample vcf file when it was killed automatically. First is do you think that it was in this step? Next question is now that it was killed and I do not have one final vcf file, I saw that there is VCFs_genotyped folder in which I can see the vcf file for each PRG. So I was thinking to combine the output from all these individual PRGs vcf files into one final vcf file. Will this work? Or is there any other way that pandora stores checkpoints and I can resume the analysis again from this particular step because I can't run the whole thing again as it took over a month to reach at this stage of the analysis.

leoisl commented 1 year ago

Hey @AmayAgrawal , sorry for the huge delay on answering you.

So I was thinking to combine the output from all these individual PRGs vcf files into one final vcf file. Will this work?

Maybe but I don't think the resulting files would be complete... I would not recommend that... Pandora also does not store checkpoints for resuming analysis later.

Bottom line is that pandora is still not scalable to thousands of samples. We are working right now to scale it to millions of genes, and next step is to scale to thousands of samples. We will let you know when this feature is implemented.

iqbal-lab commented 1 year ago

Sorry for this. I have one hack that would allow you to progress for now.

  1. Collect a subset of your genomes that runs in a reasonable time - say 50, and run pandora compare on them, and identify the inferred reference it generates for the VCFs. Then run another 2000/50 -1 jobs each one running pandora compare on 50 samples, using that same reference. This is parallelisable by you, can be run on a cluster, and will give you VCFs which are all using the same reference. It's not perfect, but it would allow you to move forwards for now.