neherlab / pangraph

A bioinformatic toolkit to align genome assemblies into pangenome graphs
https://neherlab.github.io/pangraph
MIT License
87 stars 7 forks source link

Incorrect HIV Sequences being represented by PanGraph #62

Closed TheHarshShow closed 10 months ago

TheHarshShow commented 10 months ago

Hi there,

Our lab has been working with PanGraphs for a while now and we've found them to be very useful. We believe that we've recently noticed some sequences being incorrectly represented by the PanGraph. The simplest dataset that we've found the issue on is a dataset of 2000 HIV sequences. We have attached a Google Drive link with our files. The HIV_2000.fa file stores the true sequences. The hiv_2000_pangraph.fa file consists of the sequences that we believe the PanGraph represents.

We've found seven sequences of the presumed PanGraph output to not match with the raw sequences. These are: B.RU.2004.04RU128005.AY682547, B.US.2000.14302_1.DQ853450, B.US.2000.14294_1.DQ853436, B.US.2000.14303_1.DQ853451, B.US.1998.15388_1.DQ853456, B.US.1998.15385_1.DQ853464 and B.US.1998.15386_1.DQ853460.

One thing to note is that most of these mismatches occur towards the ends of the sequences. The Google Drive link also contains the PanGraph that these sequences were derived from.

Soon, the data for 20000 HIV sequences will also be uploaded where 88/20000 sequences don't match. You can use those for testing.

Drive

Thanks, Harsh Motwani Turakhia Lab, UC San Diego

mmolari commented 10 months ago

Dear @TheHarshShow, very happy to hear that you're finding PanGraph useful! Thank you for the feedback, this is very helpful for us. And thank you for sharing the files. I'll look into this and let you know if I can reproduce and correct the issue. take care! Marco

mmolari commented 10 months ago

Hi @TheHarshShow,

I am investigating the issue. In the meantime I observed that the issue seems to be linked to the mix of uppercase and lowercase characters in your input sequences. If I run pangraph with standard parameters and the --test flag (to test automatically for correct sequence reconstruction) I can reproduce the error that you were mentioning. When inspecting the merging at which the algorithm fails I saw that the two merged graphs include uppercase and lowercase characters. I tried re-executing pangraph with the same parameters but with the --upper-case flag, that forces uppercase conversion of all input ucleotide characters, and in this case I do not detect the error. This seems to be the case on the 2'000 sequences dataset, I haven't tested for the 20'000 sequences case yet. I will investigate further but in the meantime if this issue is blocking your work you could try to add the --upper-case and see if this solves it on your side.

Cheers! Marco

TheHarshShow commented 10 months ago

Hi Marco,

Thanks a lot for looking into this issue. Thanks for letting us know about the --upper-case and --test flags. We also faced one sequence mismatch in an E-coli dataset consisting of a 100 sequences. Now, since this dataset might be hard to work with, I just provided the HIV sequences. However, if the problem here relates to lowercase characters, I believe that the E-coli dataset has a different problem since it doesn't have any lowercase characters.

I am also adding the E-coli dataset. We believe, the sequence NZ_CP006834.2 isn't represented correctly. In fact, we have pin pointed that the sequence is missing an insertion of two nucleotides at position 873,274.

Thanks, Harsh

mmolari commented 10 months ago

Hi @TheHarshShow,

after your last comment (thanks for that!) I started looking deeper into what was causing the issue in the small virus dataset, assuming that the lowercase nucleotides were not the problem. I found that during block merging there was a particular edge-case of an adjacent insertion and deletion that would cause small inconsistencies in the alignment.

I created a branch with a fix for that problem: [#63]. If possible could you test this version of pangraph on your datasets with the --test flag to see if this solves those issues as well? If so I will merge the PR and release a new version.

Thanks again!

Marco

mmolari commented 10 months ago

Hi @TheHarshShow,

another small update: I tested it on the 100 E.coli sequences. I could reproduce the error with the original version of pangraph, and the error was removed with the bug-fix. I will merge the PR and consider this issue closed, but feel free to re-open it if you encounter the error again.

Thanks again for all the feedback! Marco

TheHarshShow commented 10 months ago

Hi Marco,

Thanks a lot for looking into and fixing the issue! I think that since it's working for you, I agree that this issue can be closed. Our lab will use the latest version of PanGraph and if something doesn't work, we'll let you know.

Thanks, Harsh