Open TheHarshShow opened 5 months ago
Hi Harsh, thanks for flagging the issue! I'll look into it. Could you also write which version of PanGraph was used to generate the graph and the exact command? By any chance could you also reproduce the issue with a smaller dataset? This would greatly help in debugging. Cheers! Marco
Hi Marco,
Thanks for the quick response. We used the version 0.7.3
and the command was pangraph build --circular --upper-case -a 200 -b 30 input.fa > output.json
.
We understand that this dataset is very big, and we can try looking for issues in other datasets. However, before that, can you confirm whether we have identified the issue with the PanGraph correctly as it might be possible that we aren't properly interpreting some part of the JSON file? Is it possible that in the PanGraph JSON file that we have provided, you compute the length of the sequence NZ_AP019856.1
and tell us if you agree with our analysis? The length that we found from the PanGraph was 4800017 bases.
Thanks, Harsh
Hi Harsh, sorry for the delay, today we're having a bit of troubles with the university cluster and the graph is too big for me to open on my laptop.
I checked the full sequence reconstruction for all isolates in the graph. It looks like 64/1000 isolates have minor problems in their sequence. In particular I agree that isolate NZ_AP019856.1
should be 4800098 bp long but it is 4800017 bp in the graph.
It looks like in these cases there are few tens of bp of mismatches. I'll be investigating this further but it might take some time since it looks like these inconsistencies appear in complicated edge-cases that only happen when graphs are big and complex enough. We're working on a more robust re-implementation of some of the core functions of pangraph that will hopefully remove all of these inconsistencies once and for all. I'll keep you posted.
In the meantime thanks again for your feedback!
Marco
Hi Marco,
Thanks for confirming this issue. We will investigate other datasets to look for mismatches and let you know if we find issues so if it can help debug the issue.
Thanks, Harsh
In case it can be useful for this we added the command line option --test
for the build command. With this flag the program tests for consistency of the graphs, verifying that the input genomes can be exactly reconstructed from the output graph and fails if not. If builds succeeds with this option you can be sure that the graph is consistent. However it does not output a graph if consistency checks fail.
Thank you for all of the feedback!
Marco
Hi there,
We want to report an issue with a PanGraph that we generated on a dataset representing 1000 E. coli sequences. We believe that 64 of these sequences are not represented correctly by the PanGraph. Thankfully, since we think the sequence lengths are also wrong, we manually verified the issue by simply computing the lengths of one of the mismatching sequences. We did this by adding up the lengths of the consensus sequences of the blocks on its path and adding the lengths of the insertions in the sequences and subtracting the lengths of the deletions on the path.
We find that the sequence length of the sequence ‘NZ_AP019856.1’ is computed by the PanGraph to be 4800017 bases. However, its true length is 4800098 bases.
We have uploaded the three relevant files to the following folder: https://drive.google.com/drive/folders/1JAliSaWokYX2i5KaUjQiOPnCdL_uyZqG?usp=sharing
We believe the mismatching sequences are:
NZ_AP019856.1
,NZ_CP054407.1
,NZ_CP010219.1
,NZ_CP036202.1
,NZ_CP014583.1
,NZ_CP027587.1
,NZ_CP027325.1
,NZ_CP013029.1
,NZ_CP027459.1
,NZ_CP050865.1
,NZ_CP050862.1
,NZ_CP027534.1
,NZ_CP014316.1
,NZ_CP015085.1
,NZ_CP018970.1
,NZ_CP023826.1
,NZ_CP032201.1
,NZ_CP023844.1
,NZ_CP015138.1
,NZ_CP018983.1
,NZ_CP018991.1
,NZ_CP049077.2
,NZ_CP010876.1
,NZ_CP036245.1
,NZ_CP049085.2
,NZ_CP035476.1
,NZ_CP035477.1
,NZ_CP014522.1
,NZ_CP014495.1
,NZ_CP024720.1
,NZ_CP024717.1
,NZ_CP021207.1
,NZ_CP019008.1
,NZ_CP019020.1
,NZ_CP035498.1
,NZ_CP053245.1
,NZ_CP037449.1
,NZ_CP048304.1
,NZ_CP048920.1
,NZ_CP040456.1
,NZ_CP024886.1
,NZ_CP051700.1
,NZ_CP030111.1
,NZ_AP022650.1
,NZ_CP053251.2
,NZ_CP051688.1
,NZ_CP033762.1
,NZ_CP019273.1
,NZ_AP017610.1
,NZ_CP033850.1
,NZ_CP019029.1
,NZ_CP015834.1
,NZ_CP009859.1
,NZ_CP040919.1
,NZ_CP023366.1
,NZ_CP041300.1
,NZ_CP033605.1
,NZ_CP041452.1
,NZ_CP041448.1
,NZ_CP028166.1
,NZ_AP021896.1
,NZ_CP031833.1
Thanks, Harsh