How to improve performance for Illumina-based analysis of viral pangenome

dandaman commented 2 years ago

Hi,

I'd like to use pandora to study a viral pangenome on the basis of Illumina data. All together I'm looking at 13 input MSAs. Each comprise ~300 high-quality reference sequences. I'd like to study new samples using a pangenome graph.

As I am also looking at gramtools in parallel and wanted to assess the performance of both, I used two Sanger sequenced strains: one as reference and the other to simulate an Illumina sample at 100x and 300x coverage.

As its a single sample I've used pandora map --vcf-refs filename --kg --loci-vcf -M -I --clean --genotype

Following the suggested workflow (make_prg=0.1.1/bioconda with max_nesting: 5 and min_match_length: 7; pandora=0.9.1/bioconda) I was surprised to see that the returned personalised reference/pandora.consensus.fq.gz diverges substantially from the Sanger ref. The primary "alleles" of each prg diverges up to 2 in edit distance in 7 of the 13 prgs. 4 of the prgs even generate secondary "alleles" (e.g. prg_name.12) with much higher edit distances (15-54).

Is this to be expected? Or am I doing something wrong? If not what can I do to improve the performance?

Best, Daniel

iqbal-lab commented 2 years ago

This is super interesting. Would it be acceptable to share the data so we could take a look at what is going on? Could do via email if you prefer not on github.

dandaman commented 2 years ago

Thanks for your super-fast response! Of course, gladly - where can I find your email address?

leoisl commented 2 years ago

Hello @dandaman ,

could you send it to leandro [at] ebi [dot] ac [dot] uk? I will debug the execution with your data and try to understand it. If the data is too large to be sent through mail, please tell me that I will provide you with a link to upload it.

Cheers

dandaman commented 2 years ago

Hi @leoisl , did you receive my email? Best, Daniel

leoisl commented 2 years ago

Hey @dandaman ,

yes, I did receive it and just replied! Sorry for the delay, I did not manage to access my mail during the day as I was focusing on finishing a PR!

Cheers

dandaman commented 2 years ago

Hi @leoisl, did you have a chance to look at the data I send yet? Best, Daniel

leoisl commented 2 years ago

Hello @dandaman ,

Yes, but I was just able to take a quick look. I had to switch priorities to an urgent task from the last week, but will be able to take a detailed look by the end of next week. Sorry for the delay :(

Cheers

dandaman commented 2 years ago

Hi @leoisl ,

did you have time yet to look into this issue?

Best, Daniel

bricoletc commented 2 years ago

Hi @dandaman , can't comment on the pandora side, but was wondering if you were able to run gramtools and if so how close its personalised ref was to your Sanger ref?

dandaman commented 2 years ago

Hi @bricoletc,

yes I've used gramtools as well in the same simulation experiment and it worked perfectly! I'd have to look up the details, but if I remember correctly it was 100% id to the simulated reference :-)

So for the time being with virues I've continued with gramtools only. But I am eager to work with pandora as well as I'd like to apply this to eukaryotic genomes as well. I'm not sure gramtools would scale to that natively. Or have you experience with that?

Best, Daniel

bricoletc commented 2 years ago

Good to hear! gramtools doesn't scale well to large eukaryotic genomes (e.g. human). I use it in P. falciparum, small eukaryotic genome, 23Mbp, and that's fine (e.g. 1-5 hours runtime) however i'm not sure how it would fare on some more intermediate-size genomes, e.g. on the order of 100s of Mbp.

Also can't comment on size scaling for pandora though I think it's mostly been tested on bacterial genomes on the order of 1-10Mbp (@leoisl )

leoisl commented 2 years ago

Pandora has been extensively tested on E. coli, and some big plasmid databases, but IDK how much this second one would amount to...

Very sorry for my lack of updates @dandaman , the "quick" 2-week high-priority task I had to do became a 1-month task, and right after it I got allocated to another one that is still ongoing. Will take a look at this this week though, because otherwise I will have to postpone it even further, but I don't think I should postpone more.

Cheers and thanks for the reminder

iqbal-lab-org / pandora

How to improve performance for Illumina-based analysis of viral pangenome #291