phac-nml / mob-suite

MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
Apache License 2.0
124 stars 33 forks source link

Mob-typer returns different results when plasmids were rotated. #85

Closed ryotag closed 2 years ago

ryotag commented 3 years ago

Hi, Thanks for developing this nice tool!

I got different results for the same plasmid using MOB-typer. When I ran mob_typer --multi --infile KX912253_RC.fasta --out_file KX912253_RC.tsv, I got the results as bellow (no plasmid replicons, relaxase, mpfs detected):

sample_id   num_contigs size    gc  md5 rep_type(s) rep_type_accession(s)   relaxase_type(s)    relaxase_type_accession(s)  mpf_type    mpf_type_accession(s)   orit_type(s)orit_accession(s)   predicted_mobility  mash_nearest_neighbor   mash_neighbor_distance  mash_neighbor_identification    primary_cluster_id  secondary_cluster_id    predicted_host_range_overall_rank   predicted_host_range_overall_name   observed_host_range_ncbi_rank   observed_host_range_ncbi_name   reported_host_range_lit_rank    reported_host_range_lit_name    associated_pmid(s)
FRI-2_plasmid_KX912253-RC_Enterobacter_asburiae_strain_H162620587_plasmid_pJF-587__complete_sequence.   1   108672  51.18429770318021   5bd1577e5eae2824bbb7eb4e9ed6c126    -   -   -   non-mobilizable KX912253    0.0 Enterobacter asburiae   AA414   AI467   genus   Enterobacter    genus   Enterobacter    -   -   -

However, when I rotated the plasmid and ran mob_typer --multi --infile KX912253_RC_rotated.fasta --out_file KX912253_RC_rotated.tsv, I got the following results (now plasmid replicons, a relaxase, and mpf were detected):

sample_id   num_contigs size    gc  md5 rep_type(s) rep_type_accession(s)   relaxase_type(s)    relaxase_type_accession(s)  mpf_type    mpf_type_accession(s)   orit_type(s)orit_accession(s)   predicted_mobility  mash_nearest_neighbor   mash_neighbor_distance  mash_neighbor_identification    primary_cluster_id  secondary_cluster_id    predicted_host_range_overall_rank   predicted_host_range_overall_name   observed_host_range_ncbi_rank   observed_host_range_ncbi_name   reported_host_range_lit_rank    reported_host_range_lit_name    associated_pmid(s)
FRI-2_plasmid_KX912253-RC_concatenated  1   108672  51.18429770318021   4278a4e947e4d787148b957d35f4c27d    IncFII,IncR CP019890_00139,000207__CP025517 MOBF    NC_014107_00160 MPF_F   NC_014107_00125,NC_014107_00126,NC_014107_00127,NC_009425_00108,NC_014107_00135,NC_014107_00139,NC_014107_00145,NC_014107_00146,NC_014107_00154,NC_014107_00155,NC_014107_00137,NC_014107_00159 -   -   conjugative KX912253    0.0 Enterobacter asburiae   AA414   AI467   order   Enterobacterales    order   Enterobacterales    family  Enterobacteriaceae  20851899; 23711894

I thought the results can slightly be different after the rotation since rotation can recover a broken gene (i.e., if the gene spans the beginning and end of the contig, this gene can be recovered by the rotation). However, the results are completely different in this case and most genes used for the typing seem to be intact before the rotation. I've attached fasta files of the plasmid before/after the rotation. (the file extension is .txt since github does not allow me to attach files with .fasta) KX912253_RC.txt KX912253_RC_rotated.txt

If you have any ideas, please let me know.

Thank you,

ryotag commented 3 years ago

It would be helpful if someone could tell me this issue is reproducible or not. I'm using mob_typer 3.0.0 on my Mac (macOS High Sierra version 10.13.6).

Thank you,

ryotag commented 3 years ago

I finally found the reason for this. I changed the header of the fasta file from >FRI-2_plasmid_KX912253-RC Enterobacter asburiae strain H162620587 plasmid pJF-587, complete sequence. to >FRI-2_plasmid_KX912253-RC , and I got the correct results for the original fasta file without rotations. I think this is an important bug to be fixed, because MOB-typer cannot detect plasmid replicons/relaxases and returns wrong results for fasta files with certain headers.

jrober84 commented 2 years ago

There seems to be some issues with blast and length of headers. I have implemented a fix in 3.1.0 where all sequences are renamed internally for all of the blast and search calls. Then reported back as the original sequence identifiers.