phac-nml / mob-suite

MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
Apache License 2.0
124 stars 33 forks source link

MOB-typer failed to detect plasmid replicons and relaxases for fasta files with certain headers. #87

Closed ryotag closed 2 years ago

ryotag commented 3 years ago

This issue is related to #85, but I opened this new issue because the title of #85 does not reflect the true problem.

I found that MOB-typer failed to detect plasmid replicons and relaxases for fasta files with certain headers. Here is the file I analyzed. (the file extension is .txt since github does not allow files with .fasta) KX912253_RC.txt

mob_typer --multi --infile KX912253_RC.fasta --out_file results.tsv returned a tsv file as follows:

sample_id num_contigs size gc md5 rep_type(s) rep_type_accession(s) relaxase_type(s) relaxase_type_accession(s) mpf_type mpf_type_accession(s) orit_type(s) orit_accession(s) predicted_mobility mash_nearest_neighbor mash_neighbor_distance mash_neighbor_identification primary_cluster_id secondary_cluster_id predicted_host_range_overall_rank predicted_host_range_overall_name observed_host_range_ncbi_rank observed_host_range_ncbi_name reported_host_range_lit_rank reported_host_range_lit_name associated_pmid(s)
FRI-2_plasmid_KX912253-RC_Enterobacter_asburiae_strain_H162620587_plasmid_pJF-587__complete_sequence. 1 108672 51.1842977 5bd1577e5eae2824bbb7eb4e9ed6c126 - - - - - - - - non-mobilizable KX912253 0 Enterobacter asburiae AA414 AI467 genus Enterobacter genus Enterobacter - - -

I changed the header of the fasta file from >FRI-2_plasmid_KX912253-RC Enterobacter asburiae strain H162620587 plasmid pJF-587, complete sequence. to >FRI-2_plasmid_KX912253-RC Now, I got the following results.

sample_id num_contigs size gc md5 rep_type(s) rep_type_accession(s) relaxase_type(s) relaxase_type_accession(s) mpf_type mpf_type_accession(s) orit_type(s) orit_accession(s) predicted_mobility mash_nearest_neighbor mash_neighbor_distance mash_neighbor_identification primary_cluster_id secondary_cluster_id predicted_host_range_overall_rank predicted_host_range_overall_name observed_host_range_ncbi_rank observed_host_range_ncbi_name reported_host_range_lit_rank reported_host_range_lit_name associated_pmid(s)
FRI-2_plasmid_KX912253-RC 1 108672 51.1842977 5bd1577e5eae2824bbb7eb4e9ed6c126 IncFII,IncR CP019890_00139,000207__CP025517 MOBF NC_014107_00160 MPF_F NC_014107_00125,NC_014107_00126,NC_014107_00127,NC_009425_00108,NC_014107_00135,NC_014107_00139,NC_014107_00145,NC_014107_00146,NC_014107_00154,NC_014107_00155,NC_014107_00137,NC_014107_00159 - - conjugative KX912253 0 Enterobacter asburiae AA414 AI467 order Enterobacterales order Enterobacterales family Enterobacteriaceae 20851899; 23711894

The results seem to be quite different, one without any replicons/relaxases and the other with detected replicons/relaxases. I'm not sure this is a bug or not, but any help/thoughts are appreciated.

Thank you for your time,

jrober84 commented 2 years ago

There seems to be some issues with blast and length of headers. I have implemented a fix in 3.1.0 where all sequences are renamed internally for all of the blast and search calls. Then reported back as the original sequence identifiers.