marbl / ModDotPlot

MIT License
101 stars 7 forks source link

Mismatch sequence names in output #22

Closed dirkjanvw closed 3 months ago

dirkjanvw commented 3 months ago

Hi! Thanks for the interesting and fast tool!

I was testing it on the GCF_000001735.4 A. thaliana assembly and I noticed a mismatch between the sequence names in the output files.

This is the fasta index (.fai) for the input genomes after extracting only the five chromosomes:

NC_003070.9 30427671    56  60  61
NC_003071.7 19698289    30934920    60  61
NC_003074.8 23459830    50961579    60  61
NC_003075.7 18585056    74812472    60  61
NC_003076.8 26975502    93707344    60  61

And with ModDotPlot (@ commit c1388eba02c4d8f4d34b99b5d8f6c250a455bd28) I ran the following command:

moddotplot static -f GCF_000001735.4_TAIR10.1_genomic.chrs.fna

Unexpectedly, this was the output. Notice the mismatch between the filename and the content of the files. Which one is correct?

head NC_*bed
==> NC_003070.9.bed <==
#query_name query_start query_end   reference_name  reference_start reference_end   perID_by_events
NC_003070.9 1   30428   NC_003070.9 1   30428   100.0
NC_003070.9 1   30428   NC_003070.9 30429   60856   96.89179977979816
NC_003070.9 30429   60856   NC_003070.9 30429   60856   100.0
NC_003070.9 30429   60856   NC_003070.9 60857   91284   96.86273970525768
NC_003070.9 60857   91284   NC_003070.9 60857   91284   100.0
NC_003070.9 60857   91284   NC_003070.9 91285   121712  96.8066448041969
NC_003070.9 91285   121712  NC_003070.9 91285   121712  100.0
NC_003070.9 91285   121712  NC_003070.9 121713  152140  96.88899802419402
NC_003070.9 121713  152140  NC_003070.9 121713  152140  100.0

==> NC_003071.7.bed <==
#query_name query_start query_end   reference_name  reference_start reference_end   perID_by_events
NC_003075.7 1   19699   NC_003075.7 1   19699   100.0
NC_003075.7 1   19699   NC_003075.7 19700   39398   96.34241311605368
NC_003075.7 1   19699   NC_003075.7 39399   59097   95.02161679397038
NC_003075.7 1   19699   NC_003075.7 59098   78796   92.86765095697956
NC_003075.7 1   19699   NC_003075.7 1615319 1635017 88.19960026756749
NC_003075.7 1   19699   NC_003075.7 1635018 1654716 88.19960026756749
NC_003075.7 1   19699   NC_003075.7 2541172 2560870 91.12687268394012
NC_003075.7 1   19699   NC_003075.7 2560871 2580569 93.41097733509187
NC_003075.7 1   19699   NC_003075.7 3210938 3230636 92.00089775836454

==> NC_003074.8.bed <==
#query_name query_start query_end   reference_name  reference_start reference_end   perID_by_events
NC_003074.8 1   23460   NC_003074.8 1   23460   100.0
NC_003074.8 1   23460   NC_003074.8 23461   46920   96.83259585797633
NC_003074.8 23461   46920   NC_003074.8 23461   46920   100.0
NC_003074.8 23461   46920   NC_003074.8 46921   70380   96.68315315308905
NC_003074.8 46921   70380   NC_003074.8 46921   70380   100.0
NC_003074.8 46921   70380   NC_003074.8 70381   93840   96.97521648973213
NC_003074.8 70381   93840   NC_003074.8 70381   93840   100.0
NC_003074.8 70381   93840   NC_003074.8 93841   117300  96.70998860321123
NC_003074.8 93841   117300  NC_003074.8 93841   117300  100.0

==> NC_003075.7.bed <==
#query_name query_start query_end   reference_name  reference_start reference_end   perID_by_events
NC_003076.8 1   18586   NC_003076.8 1   18586   100.0
NC_003076.8 1   18586   NC_003076.8 18587   37172   96.95608358007378
NC_003076.8 18587   37172   NC_003076.8 18587   37172   100.0
NC_003076.8 18587   37172   NC_003076.8 37173   55758   96.87871811713131
NC_003076.8 37173   55758   NC_003076.8 37173   55758   100.0
NC_003076.8 37173   55758   NC_003076.8 55759   74344   96.87827825075271
NC_003076.8 55759   74344   NC_003076.8 55759   74344   100.0
NC_003076.8 55759   74344   NC_003076.8 74345   92930   96.91004434964874
NC_003076.8 74345   92930   NC_003076.8 74345   92930   100.0

==> NC_003076.8.bed <==
#query_name query_start query_end   reference_name  reference_start reference_end   perID_by_events
NC_003071.7 1   26976   NC_003071.7 1   26976   100.0
NC_003071.7 1   26976   NC_003071.7 26977   53952   96.89687811961811
NC_003071.7 26977   53952   NC_003071.7 26977   53952   100.0
NC_003071.7 26977   53952   NC_003071.7 53953   80928   96.7788487007729
NC_003071.7 53953   80928   NC_003071.7 53953   80928   100.0
NC_003071.7 53953   80928   NC_003071.7 80929   107904  96.86344796571336
NC_003071.7 53953   80928   NC_003071.7 8902081 8929056 90.20730812484314
NC_003071.7 53953   80928   NC_003071.7 8929057 8956032 90.20730812484314
NC_003071.7 53953   80928   NC_003071.7 12085249    12112224    91.56204318214218
alexsweeten commented 3 months ago

Thanks @dirkjanvw for catching this! This is a bug where a multifasta file would be sorted by genome size, but the names would be sorted by order of entry.

This should be patched in latest commit c1d945a

dirkjanvw commented 3 months ago

Thank you! I reran it with the v0.8.2 and it looks perfect now :)