neherlab / pangraph

A bioinformatic toolkit to align genome assemblies into pangenome graphs
https://neherlab.github.io/pangraph
MIT License
77 stars 7 forks source link

Incorrect E. coli sequences being represented by PanGraph (large dataset) #68

Open TheHarshShow opened 5 months ago

TheHarshShow commented 5 months ago

Hi there,

We want to report an issue with a PanGraph that we generated on a dataset representing 1000 E. coli sequences. We believe that 64 of these sequences are not represented correctly by the PanGraph. 
Thankfully, since we think the sequence lengths are also wrong, we manually verified the issue by simply computing the lengths of one of the mismatching sequences. We did this by adding up the lengths of the consensus sequences of the blocks on its path and adding the lengths of the insertions in the sequences and subtracting the lengths of the deletions on the path.

We find that the sequence length of the sequence ‘NZ_AP019856.1’ is computed by the PanGraph to be 4800017 bases. However, its true length is 4800098 bases.

We have uploaded the three relevant files to the following folder: https://drive.google.com/drive/folders/1JAliSaWokYX2i5KaUjQiOPnCdL_uyZqG?usp=sharing

We believe the mismatching sequences are: NZ_AP019856.1, NZ_CP054407.1, NZ_CP010219.1, NZ_CP036202.1, NZ_CP014583.1, NZ_CP027587.1, NZ_CP027325.1, NZ_CP013029.1, NZ_CP027459.1, NZ_CP050865.1, NZ_CP050862.1, NZ_CP027534.1, NZ_CP014316.1, NZ_CP015085.1, NZ_CP018970.1, NZ_CP023826.1, NZ_CP032201.1, NZ_CP023844.1, NZ_CP015138.1, NZ_CP018983.1, NZ_CP018991.1, NZ_CP049077.2, NZ_CP010876.1, NZ_CP036245.1, NZ_CP049085.2, NZ_CP035476.1, NZ_CP035477.1, NZ_CP014522.1, NZ_CP014495.1, NZ_CP024720.1, NZ_CP024717.1, NZ_CP021207.1, NZ_CP019008.1, NZ_CP019020.1, NZ_CP035498.1, NZ_CP053245.1, NZ_CP037449.1, NZ_CP048304.1, NZ_CP048920.1, NZ_CP040456.1, NZ_CP024886.1, NZ_CP051700.1, NZ_CP030111.1, NZ_AP022650.1, NZ_CP053251.2, NZ_CP051688.1, NZ_CP033762.1, NZ_CP019273.1, NZ_AP017610.1, NZ_CP033850.1, NZ_CP019029.1, NZ_CP015834.1, NZ_CP009859.1, NZ_CP040919.1, NZ_CP023366.1, NZ_CP041300.1, NZ_CP033605.1, NZ_CP041452.1, NZ_CP041448.1, NZ_CP028166.1, NZ_AP021896.1, NZ_CP031833.1

Thanks, Harsh

mmolari commented 5 months ago

Hi Harsh, thanks for flagging the issue! I'll look into it. Could you also write which version of PanGraph was used to generate the graph and the exact command? By any chance could you also reproduce the issue with a smaller dataset? This would greatly help in debugging. Cheers! Marco

TheHarshShow commented 5 months ago

Hi Marco,

Thanks for the quick response. We used the version 0.7.3 and the command was pangraph build --circular --upper-case -a 200 -b 30 input.fa > output.json.

We understand that this dataset is very big, and we can try looking for issues in other datasets. However, before that, can you confirm whether we have identified the issue with the PanGraph correctly as it might be possible that we aren't properly interpreting some part of the JSON file? Is it possible that in the PanGraph JSON file that we have provided, you compute the length of the sequence NZ_AP019856.1 and tell us if you agree with our analysis? The length that we found from the PanGraph was 4800017 bases.

Thanks, Harsh

mmolari commented 5 months ago

Hi Harsh, sorry for the delay, today we're having a bit of troubles with the university cluster and the graph is too big for me to open on my laptop.

I checked the full sequence reconstruction for all isolates in the graph. It looks like 64/1000 isolates have minor problems in their sequence. In particular I agree that isolate NZ_AP019856.1 should be 4800098 bp long but it is 4800017 bp in the graph.

Here is a full list of the sequences containing small inconsistencies ``` --> isolate 'NZ_CP054407.1' incorrectly reconstructed length of graph seq: 4954286 length of ref: 4954362 --> isolate 'NZ_CP027534.1' incorrectly reconstructed length of graph seq: 5022404 length of ref: 5022408 --> isolate 'NZ_CP014316.1' incorrectly reconstructed length of graph seq: 5081057 length of ref: 5081061 --> isolate 'NZ_CP014522.1' incorrectly reconstructed length of graph seq: 5033278 length of ref: 5033359 --> isolate 'NZ_CP051688.1' incorrectly reconstructed length of graph seq: 5328979 length of ref: 5329017 --> isolate 'NZ_CP019008.1' incorrectly reconstructed length of graph seq: 4926068 length of ref: 4926149 --> isolate 'NZ_CP027325.1' incorrectly reconstructed length of graph seq: 5135671 length of ref: 5135675 --> isolate 'NZ_CP015085.1' incorrectly reconstructed length of graph seq: 5289894 length of ref: 5289898 --> isolate 'NZ_CP053245.1' incorrectly reconstructed length of graph seq: 4675458 length of ref: 4675501 --> isolate 'NZ_CP024886.1' incorrectly reconstructed length of graph seq: 5036886 length of ref: 5036925 --> isolate 'NZ_CP019020.1' incorrectly reconstructed length of graph seq: 4913178 length of ref: 4913259 --> isolate 'NZ_AP022650.1' incorrectly reconstructed length of graph seq: 5075871 length of ref: 5075911 --> isolate 'NZ_CP033850.1' incorrectly reconstructed length of graph seq: 5231412 length of ref: 5231450 --> isolate 'NZ_CP018970.1' incorrectly reconstructed length of graph seq: 5259383 length of ref: 5259387 --> isolate 'NZ_CP051700.1' incorrectly reconstructed length of graph seq: 5053498 length of ref: 5053537 --> isolate 'NZ_CP014495.1' incorrectly reconstructed length of graph seq: 5061740 length of ref: 5061821 --> isolate 'NZ_CP040919.1' incorrectly reconstructed length of graph seq: 5209400 length of ref: 5209476 --> isolate 'NZ_CP035476.1' incorrectly reconstructed length of graph seq: 5018238 length of ref: 5018242 --> isolate 'NZ_CP032201.1' incorrectly reconstructed length of graph seq: 5107211 length of ref: 5107215 --> isolate 'NZ_AP021896.1' incorrectly reconstructed length of graph seq: 4574662 length of ref: 4574715 --> isolate 'NZ_CP015138.1' incorrectly reconstructed length of graph seq: 5009896 length of ref: 5009900 --> isolate 'NZ_CP018983.1' incorrectly reconstructed length of graph seq: 4947528 length of ref: 4947532 --> isolate 'NZ_AP019856.1' incorrectly reconstructed length of graph seq: 4800017 length of ref: 4800098 --> isolate 'NZ_CP023826.1' incorrectly reconstructed length of graph seq: 5129464 length of ref: 5129468 --> isolate 'NZ_CP040456.1' incorrectly reconstructed length of graph seq: 5234111 length of ref: 5234468 --> isolate 'NZ_CP015834.1' incorrectly reconstructed length of graph seq: 5176712 length of ref: 5176750 --> isolate 'NZ_CP010219.1' incorrectly reconstructed length of graph seq: 5102478 length of ref: 5102554 --> isolate 'NZ_CP050865.1' incorrectly reconstructed length of graph seq: 4899865 length of ref: 4899869 --> isolate 'NZ_CP018991.1' incorrectly reconstructed length of graph seq: 5434741 length of ref: 5434745 --> isolate 'NZ_CP023366.1' incorrectly reconstructed length of graph seq: 4986674 length of ref: 4986712 --> isolate 'NZ_CP027445.1' incorrectly reconstructed length of graph seq: 5196101 length of ref: 5196105 --> isolate 'NZ_CP019273.1' incorrectly reconstructed length of graph seq: 5050946 length of ref: 5050984 --> isolate 'NZ_CP027587.1' incorrectly reconstructed length of graph seq: 5235556 length of ref: 5235560 --> isolate 'NZ_CP033605.1' incorrectly reconstructed length of graph seq: 5569767 length of ref: 5569804 --> isolate 'NZ_CP036202.1' incorrectly reconstructed length of graph seq: 4834237 length of ref: 4834354 --> isolate 'NZ_CP041300.1' incorrectly reconstructed length of graph seq: 5083034 length of ref: 5083072 --> isolate 'NZ_CP010876.1' incorrectly reconstructed length of graph seq: 5010880 length of ref: 5010884 --> isolate 'NZ_CP031833.1' incorrectly reconstructed length of graph seq: 4854454 length of ref: 4854459 --> isolate 'NZ_CP023844.1' incorrectly reconstructed length of graph seq: 5144476 length of ref: 5144480 --> isolate 'NZ_CP019029.1' incorrectly reconstructed length of graph seq: 5262936 length of ref: 5262974 --> isolate 'NZ_CP048304.1' incorrectly reconstructed length of graph seq: 4959702 length of ref: 4959978 --> isolate 'NZ_CP049077.2' incorrectly reconstructed length of graph seq: 5295147 length of ref: 5295151 --> isolate 'NZ_CP036245.1' incorrectly reconstructed length of graph seq: 5187765 length of ref: 5187769 --> isolate 'NZ_AP017610.1' incorrectly reconstructed length of graph seq: 4920790 length of ref: 4920828 --> isolate 'NZ_CP035477.1' incorrectly reconstructed length of graph seq: 5061269 length of ref: 5061273 --> isolate 'NZ_CP013029.1' incorrectly reconstructed length of graph seq: 5202846 length of ref: 5202850 --> isolate 'NZ_CP027459.1' incorrectly reconstructed length of graph seq: 5253708 length of ref: 5253712 --> isolate 'NZ_CP014583.1' incorrectly reconstructed length of graph seq: 5193732 length of ref: 5193734 --> isolate 'NZ_CP048920.1' incorrectly reconstructed length of graph seq: 5356778 length of ref: 5357129 --> isolate 'NZ_CP010183.1' incorrectly reconstructed length of graph seq: 4940358 length of ref: 4940434 --> isolate 'NZ_CP009859.1' incorrectly reconstructed length of graph seq: 5310473 length of ref: 5310511 --> isolate 'NZ_CP030111.1' incorrectly reconstructed length of graph seq: 4939419 length of ref: 4939457 --> isolate 'NZ_CP049085.2' incorrectly reconstructed length of graph seq: 5255745 length of ref: 5255749 --> isolate 'NZ_CP035498.1' incorrectly reconstructed length of graph seq: 5349746 length of ref: 5349824 --> isolate 'NZ_CP033762.1' incorrectly reconstructed length of graph seq: 4977685 length of ref: 4977723 --> isolate 'NZ_CP021207.1' incorrectly reconstructed length of graph seq: 5013732 length of ref: 5013813 --> isolate 'NZ_CP050862.1' incorrectly reconstructed length of graph seq: 4901394 length of ref: 4901398 --> isolate 'NZ_CP041452.1' incorrectly reconstructed length of graph seq: 4662326 length of ref: 4662393 --> isolate 'NZ_CP041448.1' incorrectly reconstructed length of graph seq: 4860466 length of ref: 4860533 --> isolate 'NZ_CP024720.1' incorrectly reconstructed length of graph seq: 5196876 length of ref: 5196957 --> isolate 'NZ_CP028166.1' incorrectly reconstructed length of graph seq: 4815061 length of ref: 4815114 --> isolate 'NZ_CP024717.1' incorrectly reconstructed length of graph seq: 5196875 length of ref: 5196956 --> isolate 'NZ_CP037449.1' incorrectly reconstructed length of graph seq: 5080445 length of ref: 5080721 --> isolate 'NZ_CP053251.2' incorrectly reconstructed length of graph seq: 5121194 length of ref: 5121232 ```

It looks like in these cases there are few tens of bp of mismatches. I'll be investigating this further but it might take some time since it looks like these inconsistencies appear in complicated edge-cases that only happen when graphs are big and complex enough. We're working on a more robust re-implementation of some of the core functions of pangraph that will hopefully remove all of these inconsistencies once and for all. I'll keep you posted.

In the meantime thanks again for your feedback!

Marco

TheHarshShow commented 5 months ago

Hi Marco,

Thanks for confirming this issue. We will investigate other datasets to look for mismatches and let you know if we find issues so if it can help debug the issue.

Thanks, Harsh

mmolari commented 5 months ago

In case it can be useful for this we added the command line option --test for the build command. With this flag the program tests for consistency of the graphs, verifying that the input genomes can be exactly reconstructed from the output graph and fails if not. If builds succeeds with this option you can be sure that the graph is consistent. However it does not output a graph if consistency checks fail.

Thank you for all of the feedback!

Marco