phac-nml / mob-suite

MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
Apache License 2.0
124 stars 33 forks source link

Relaxase database of MOB-typer: several truncated proteins and a transposase #170

Open Phytobacteriology-UPNA opened 3 months ago

Phytobacteriology-UPNA commented 3 months ago

First, thanks so much for this fantastic tool

I was analyzing a collection of plasmids from pseudomonads, and came across predictions of relaxases that were unexpected, because I got a prediction with lower numbers using MOBscan.

Example:

NC_019265: predicted conjugative, but appears non-mobilizable in vivo (https://doi.org/10.3389/fmicb.2022.1076710) MOB-typer predicts a realxase by comparison with the type accession NC_019265_00015

After going round trying to find what was that, I found a post that described that these accession could be found in the database: https://zenodo.org/record/3786915/files/data.tar.gz?download=1

I got NC_019265_00015 from the database of mob.proteins.faa and got the sequence from NC_019265.

It NC_019265_00015 a transposase (WP_095178853.1)

I looked at other sequences in the database, and found several that were truncated relaxases --> Would it be possible that those will result in the spurious identification of truncated relaxases when analyzing plasmids?

Some other proteins, were relaxases that MOBscan was not able to identify. Thanks!

Phytobacteriology-UPNA commented 3 months ago

Other two putative transposases found in the relaxase DB; you can confirm by blast against IS-Finder database (https://www-is.biotoul.fr/blast.php?prog_blast=blastp):

NC_007507_00032|MOBP(ISXac3) MCRVLRVNRSGYYAWLCSPNSERAKEDDRLLGLIKHHWLASGSVYGHRKITTDLRDLGERCSRHRVHRLMRTEGLRAQVGYGRKPRFHGGMQCKAAANLLDRQFDVTEPDTAWASDFTFIRTHEGWMYLAVVIDLFSRQVVGWAMRDRADTELVVQAVLSAVWRRKPNAGCLVHSDQGSVYTSDDWRSFLASHGLVCSMSRRGNCHDNAPVESFFGLLKRERIRRLTYPTKDAARAEVFDYIEMFYNPNRRHGSTGDLSPVEFERRYAQRGS

CP026563_00069|MOBP(ISPsy4) MLTQEQSVEIKVLARQGHGIKFIARELGISRNTVRKYLRKARSLPSDKVRPARPCKIDPFKDYLHERIEAARPHWIPATVLLREITALGYSGGVSRLKAYIRPFKRKAEEPVVRFETLPGKQIQVDFTTIRRGRQPLKAFVATLGFSRASFVRFSEREDSEAWLTGLREAFAYFGGVPEQALFDNAGMNMVAAQSRRCQDPVYHFILSWRENELPTDAQIFECAEHCIRQLGMEGHQYVTAIHQDTDNTHCHVAVNRVNPITYKAAALWNDADTLQKSCRVLERKYGFIQDNGSWQWGVNDQLVRAPFRYGSAPQGTVPLQVYSNTESLYHYAVREVREKVSELIESRAITWRQIHLALHERGLGLREQGEGLVIYDFLRPEGPVVKASSVHPTLTKFRLEAHIGAFEGPPTFEHEEWSYGIFSSYQPAFELRDKDVRFDRRQARAEARLDLKMRYKRYREGWEKPDLHVKDRYQQVAARYQAMKADVKRSQHDPLLRKLLYRVAEFDRMKAMAELRIELRDERQALAEKGLLRPLAYRPWVEQQALRGDVAAVSQLRGFVYREKRKERTPNGGFDRVIQCGQADDSAVYHLRSYTSHLHRDGTVEYLRDGRVGVIDRGDFVQVKPGFNDDDDLDNYRLAANLVSTKSGDAVKIIGDDQFVDQVLDAGCGVNHRGSQYVFQVTDPEQLARYDVIERDHRQYYGYDEPSRPQSPVRHDPVDDAPDDGYQPPRPFGG

kbessonov1984 commented 1 month ago

Thank you for pointing to a potential review of the mob.proteins.faa database and potential issue with the NC_019265_00015 entry. We will also take a look at MOBScan webapp (https://castillo.dicom.unican.es/mobscan_about/) and MOBfamDB https://castillo.dicom.unican.es/mobscan_about/MOBfamDB.gz

Phytobacteriology-UPNA commented 1 month ago

Thanks for your reply. I have found a few other issues:

Relaxases database: There are other sequences classified as relaxases, but which are not. Actually, any short sequence is suspicious. For instance, this entry:

FJ696405|Col(Ye4449) TTAACGATCACGGTGCTGCTCCAGCAGTTCACCGAGATTGCGGTCGAGGCTGTTTAACACCGCCAGCAGGGACACACGCTCAGTGGGATTCAGTCCGTCGAGCTGGTTAAGACGGCGGGCGATCTGGTTGAGGTTGTTGCCTATCCCGCTGACCTGTCGCAGCAGTTCGGGCGCCACATCGGGCAGACGGCGCT

This entry is a partial sequence. Additionally, it does not correspond to a relaxase, but to a gene (generally called mobC, or relaxase accessory protein-RAP) that usually precedes the relaxase gene -meaning that is not detecting a relaxase, which might be truncated, for instance-.

Rep protein database This database also contains spurious entries. Again, anything that is too short is, in my opinion, suspicious. For instance:

AJ851089|IncFII CACACCATCCTGCACTTATGTTGCACAGAAGGAGTGAGCACAGAAAGAAGTCTTGAACTTTTCCGGTCATATAACTATACTCCCCGCATAGAGCAACAGCTTCTATGCAGTTTCTTGTTAGCCCCGGTAATCTTCTCTTAGTCGCCAAACCTGGTGAAGATTATCGGGGTTTTTGCTTTTCTGGCTCCTGTAGATCCACATCAGAACCAGTTCCCTGCCACCTTACGGCGTGGCCAGCCACAAAATTCCTTAAACGATCAG

Corresponds to a putative regulatory protein, which are common preceding RepA, but not to the actual RepA, which is:

https://www.ncbi.nlm.nih.gov/nuccore/AJ851089.1?from=21947&to=22804&report=gbwithparts https://www.ncbi.nlm.nih.gov/protein/62550802

Similar thing with these:

000124__KP125893_00142|IncFII 000129CP018340|IncFII 000124KP125893_00142|IncFII

If you would not mind, I would also like to offer some suggestions that, I think, could make the program easier to use and more useful:

Again, thanks very much for this nice tool, and I hope that these comments are of use.