phac-nml / mob-suite

MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
Apache License 2.0
111 stars 31 forks source link

stably sort contig link counts #159

Open jchorl opened 3 months ago

jchorl commented 3 months ago

We noticed an issue whereby the order of a fasta impacted the plasmid assignment. I.e. moving a contig from the start of a fasta to the end moved it from plasmid to chromosome.

The cause was a lack of stable sorting for contig link counts.

When assigning clusters based on low-linkage contigs, the contig link counts are sorted ascending based on number of links. This means that contigs with less links come first:

OrderedDict([('93_8c4ba764a297754de0bcf081764fadef_circular=false', 1), 
             ('142_494caf9c7c04ae64d92f9c52294a8928_circular=false', 1),
             ('141_114a1e3e2d432957a53f2f55b3976d00_circular=false', 1), 
             ('161_07c8c980bc4cb398f48683b1c1ebddff_circular=false', 1),
...

This makes sense. However, given that many contigs share a link count, the sorting was dependent on the order of the sequences in the input fasta. contig_clust_assoc was already defined in the sorting/ordering function, so I assume it was meant to be used as a tie-breaker for sorting. However, it didn't appear to be used for anything.

By incorporating the cluster scores, we can sort the contigs first by link count, and secondarily by score (with highest scores coming first). This brings stable sort ordering regardless of the order of the fasta file.

kbessonov1984 commented 1 month ago

I had submitted similar PR #164 with addresses this issue by first ordering input contigs based on length from longest to shortest and then by md5 checksum, in addition for contigs with identical number of links to plasmid primary clusters, I order by contig length. Since md5cheksum is unique, this double ordering both brings consistency to contig_report.txt and mob_typer.txt and is biologically and practically sound as longer contigs usually are more information rich. Ordering by primary plasmid cluster score seems to me a bit less stable as is done anyways later in the code where those highly covered plasmid clusters are assigned contigs first anyways. James will review the PR and will then probably push a new release and conda build