Open dandaman opened 2 years ago
This is super interesting. Would it be acceptable to share the data so we could take a look at what is going on? Could do via email if you prefer not on github.
Thanks for your super-fast response! Of course, gladly - where can I find your email address?
Hello @dandaman ,
could you send it to leandro [at] ebi [dot] ac [dot] uk? I will debug the execution with your data and try to understand it. If the data is too large to be sent through mail, please tell me that I will provide you with a link to upload it.
Cheers
Hi @leoisl , did you receive my email? Best, Daniel
Hey @dandaman ,
yes, I did receive it and just replied! Sorry for the delay, I did not manage to access my mail during the day as I was focusing on finishing a PR!
Cheers
Hi @leoisl, did you have a chance to look at the data I send yet? Best, Daniel
Hello @dandaman ,
Yes, but I was just able to take a quick look. I had to switch priorities to an urgent task from the last week, but will be able to take a detailed look by the end of next week. Sorry for the delay :(
Cheers
Hi @leoisl ,
did you have time yet to look into this issue?
Best, Daniel
Hi @dandaman , can't comment on the pandora side, but was wondering if you were able to run gramtools
and if so how close its personalised ref was to your Sanger ref?
Hi @bricoletc,
yes I've used gramtools
as well in the same simulation experiment and it worked perfectly! I'd have to look up the details, but if I remember correctly it was 100% id to the simulated reference :-)
So for the time being with virues I've continued with gramtools only. But I am eager to work with pandora as well as I'd like to apply this to eukaryotic genomes as well. I'm not sure gramtools would scale to that natively. Or have you experience with that?
Best, Daniel
Good to hear! gramtools
doesn't scale well to large eukaryotic genomes (e.g. human). I use it in P. falciparum, small eukaryotic genome, 23Mbp, and that's fine (e.g. 1-5 hours runtime) however i'm not sure how it would fare on some more intermediate-size genomes, e.g. on the order of 100s of Mbp.
Also can't comment on size scaling for pandora
though I think it's mostly been tested on bacterial genomes on the order of 1-10Mbp (@leoisl )
Pandora has been extensively tested on E. coli, and some big plasmid databases, but IDK how much this second one would amount to...
Very sorry for my lack of updates @dandaman , the "quick" 2-week high-priority task I had to do became a 1-month task, and right after it I got allocated to another one that is still ongoing. Will take a look at this this week though, because otherwise I will have to postpone it even further, but I don't think I should postpone more.
Cheers and thanks for the reminder
Hi,
I'd like to use pandora to study a viral pangenome on the basis of Illumina data. All together I'm looking at 13 input MSAs. Each comprise ~300 high-quality reference sequences. I'd like to study new samples using a pangenome graph.
As I am also looking at
gramtools
in parallel and wanted to assess the performance of both, I used two Sanger sequenced strains: one as reference and the other to simulate an Illumina sample at 100x and 300x coverage.As its a single sample I've used
pandora map --vcf-refs filename --kg --loci-vcf -M -I --clean --genotype
Following the suggested workflow (make_prg=0.1.1/bioconda with max_nesting: 5 and min_match_length: 7; pandora=0.9.1/bioconda) I was surprised to see that the returned personalised reference/pandora.consensus.fq.gz diverges substantially from the Sanger ref. The primary "alleles" of each prg diverges up to 2 in edit distance in 7 of the 13 prgs. 4 of the prgs even generate secondary "alleles" (e.g. prg_name.12) with much higher edit distances (15-54).
Is this to be expected? Or am I doing something wrong? If not what can I do to improve the performance?
Best, Daniel