Relaxase database of MOB-typer: several truncated proteins and a transposase

Phytobacteriology-UPNA commented 3 months ago

First, thanks so much for this fantastic tool

I was analyzing a collection of plasmids from pseudomonads, and came across predictions of relaxases that were unexpected, because I got a prediction with lower numbers using MOBscan.

Example:

NC_019265: predicted conjugative, but appears non-mobilizable in vivo (https://doi.org/10.3389/fmicb.2022.1076710) MOB-typer predicts a realxase by comparison with the type accession NC_019265_00015

After going round trying to find what was that, I found a post that described that these accession could be found in the database: https://zenodo.org/record/3786915/files/data.tar.gz?download=1

I got NC_019265_00015 from the database of mob.proteins.faa and got the sequence from NC_019265.

It NC_019265_00015 a transposase (WP_095178853.1)

I looked at other sequences in the database, and found several that were truncated relaxases --> Would it be possible that those will result in the spurious identification of truncated relaxases when analyzing plasmids?

Some other proteins, were relaxases that MOBscan was not able to identify. Thanks!

Phytobacteriology-UPNA commented 3 months ago

Other two putative transposases found in the relaxase DB; you can confirm by blast against IS-Finder database (https://www-is.biotoul.fr/blast.php?prog_blast=blastp):

NC_007507_00032|MOBP(ISXac3) MCRVLRVNRSGYYAWLCSPNSERAKEDDRLLGLIKHHWLASGSVYGHRKITTDLRDLGERCSRHRVHRLMRTEGLRAQVGYGRKPRFHGGMQCKAAANLLDRQFDVTEPDTAWASDFTFIRTHEGWMYLAVVIDLFSRQVVGWAMRDRADTELVVQAVLSAVWRRKPNAGCLVHSDQGSVYTSDDWRSFLASHGLVCSMSRRGNCHDNAPVESFFGLLKRERIRRLTYPTKDAARAEVFDYIEMFYNPNRRHGSTGDLSPVEFERRYAQRGS

CP026563_00069|MOBP(ISPsy4) MLTQEQSVEIKVLARQGHGIKFIARELGISRNTVRKYLRKARSLPSDKVRPARPCKIDPFKDYLHERIEAARPHWIPATVLLREITALGYSGGVSRLKAYIRPFKRKAEEPVVRFETLPGKQIQVDFTTIRRGRQPLKAFVATLGFSRASFVRFSEREDSEAWLTGLREAFAYFGGVPEQALFDNAGMNMVAAQSRRCQDPVYHFILSWRENELPTDAQIFECAEHCIRQLGMEGHQYVTAIHQDTDNTHCHVAVNRVNPITYKAAALWNDADTLQKSCRVLERKYGFIQDNGSWQWGVNDQLVRAPFRYGSAPQGTVPLQVYSNTESLYHYAVREVREKVSELIESRAITWRQIHLALHERGLGLREQGEGLVIYDFLRPEGPVVKASSVHPTLTKFRLEAHIGAFEGPPTFEHEEWSYGIFSSYQPAFELRDKDVRFDRRQARAEARLDLKMRYKRYREGWEKPDLHVKDRYQQVAARYQAMKADVKRSQHDPLLRKLLYRVAEFDRMKAMAELRIELRDERQALAEKGLLRPLAYRPWVEQQALRGDVAAVSQLRGFVYREKRKERTPNGGFDRVIQCGQADDSAVYHLRSYTSHLHRDGTVEYLRDGRVGVIDRGDFVQVKPGFNDDDDLDNYRLAANLVSTKSGDAVKIIGDDQFVDQVLDAGCGVNHRGSQYVFQVTDPEQLARYDVIERDHRQYYGYDEPSRPQSPVRHDPVDDAPDDGYQPPRPFGG

kbessonov1984 commented 1 month ago

Thank you for pointing to a potential review of the mob.proteins.faa database and potential issue with the NC_019265_00015 entry. We will also take a look at MOBScan webapp (https://castillo.dicom.unican.es/mobscan_about/) and MOBfamDB https://castillo.dicom.unican.es/mobscan_about/MOBfamDB.gz

Phytobacteriology-UPNA commented 1 month ago

Thanks for your reply. I have found a few other issues:

Relaxases database: There are other sequences classified as relaxases, but which are not. Actually, any short sequence is suspicious. For instance, this entry:

FJ696405|Col(Ye4449) TTAACGATCACGGTGCTGCTCCAGCAGTTCACCGAGATTGCGGTCGAGGCTGTTTAACACCGCCAGCAGGGACACACGCTCAGTGGGATTCAGTCCGTCGAGCTGGTTAAGACGGCGGGCGATCTGGTTGAGGTTGTTGCCTATCCCGCTGACCTGTCGCAGCAGTTCGGGCGCCACATCGGGCAGACGGCGCT

This entry is a partial sequence. Additionally, it does not correspond to a relaxase, but to a gene (generally called mobC, or relaxase accessory protein-RAP) that usually precedes the relaxase gene -meaning that is not detecting a relaxase, which might be truncated, for instance-.

Rep protein database This database also contains spurious entries. Again, anything that is too short is, in my opinion, suspicious. For instance:

AJ851089|IncFII CACACCATCCTGCACTTATGTTGCACAGAAGGAGTGAGCACAGAAAGAAGTCTTGAACTTTTCCGGTCATATAACTATACTCCCCGCATAGAGCAACAGCTTCTATGCAGTTTCTTGTTAGCCCCGGTAATCTTCTCTTAGTCGCCAAACCTGGTGAAGATTATCGGGGTTTTTGCTTTTCTGGCTCCTGTAGATCCACATCAGAACCAGTTCCCTGCCACCTTACGGCGTGGCCAGCCACAAAATTCCTTAAACGATCAG

Corresponds to a putative regulatory protein, which are common preceding RepA, but not to the actual RepA, which is:

https://www.ncbi.nlm.nih.gov/nuccore/AJ851089.1?from=21947&to=22804&report=gbwithparts https://www.ncbi.nlm.nih.gov/protein/62550802

Similar thing with these:

000124__KP125893_00142|IncFII 000129CP018340|IncFII 000124KP125893_00142|IncFII

If you would not mind, I would also like to offer some suggestions that, I think, could make the program easier to use and more useful:

Use a database of Rep proteins instead of rep genes (or at least, let the user use one or the other), in order to detect remote homologs and limit the detection of truncated sequences as if they were full-length.
Define the Rep groups based on homology. Right now, some groups contain sequences that do not show significant homology. For instance, the translated sequence of 000188NC_013973_00002|IncP does not show significant homology with the products of 000177AB237782_00014|IncP, 000180NC_006830_00001|IncP or 000182CP002151_00001|IncP, despite all being classified as IncP. The same happens with certain IncFII entries.
Change the heading of the entries in the database. I think it would make it much easier and useful for the user if the rep_type_accession (and relaxase_type_accession) would provide protein accession numbers instead of the current codes.
Provide some (simple?) instructions to indicate what parameters can be changed and how to customize the search.
Remove partial sequences. These easily identify truncated genes and reports them as if they were full-length.

Again, thanks very much for this nice tool, and I hope that these comments are of use.

phac-nml / mob-suite

Relaxase database of MOB-typer: several truncated proteins and a transposase #170