mtisza1 / Cenote_Unlimited_Breadsticks

DEPRECATED: Discover divergent virus sequences, prune flanking cellular sequences, make basic report
MIT License
6 stars 1 forks source link

Unlimited Breadsticks and plasmids #1

Closed aag1 closed 3 years ago

aag1 commented 3 years ago

Dear Mike,

Thank you for developing this useful tool! Could you please help me with one question about its application? The GitHub page describing Unlimited Breadsticks says “Unlimited Breadsticks does not do post-hallmark-gene-identification computations to flag plasmid and conjugative element sequences that occasionally slip through”, but Unlimited Breadsticks has an option “--filter_out_plasmids”. Could you please advise, if I specify “--filter_out_plasmids True” when running Unlimited Breadsticks, would it try to remove plasmids from the viral contigs list in the same way as Cenote-Taker2?

Kind regards, Anastasia

mtisza1 commented 3 years ago

Hi Anastasia,

Thanks for opening the issue. Some plasmids, conjugative elements, and phages encode replication genes (such as PolB or certain DNA helicases) that are homologous to each other, and horizontal gene transfer of replication modules probably occurs between these types of elements. Therefore, HMMs generated from virus replication genes can sometimes 'ping' plasmid sequences and vice versa. To get around this, I've made hallmark gene HMMs from a bunch of plasmid replication genes. Unlimited Breadsticks takes the TOP HMM hit from these hallmark gene searches. My thinking is that, in general, HMMs made from plasmid replication genes will be the top hit for plasmid sequences, and the HMMs made from virus replication genes will be the top hit for virus sequences. This works a large majority of the time.

However, only Cenote-Taker 2 does the post-hallmark-gene-identification steps, which are: For plasmids, conducting a blastx/blastp search against a database of plasmids and viruses, which is used for the final "taxonomy" call. For conjugative elements, the post-hallmark-gene-identification earmarks contigs that do NOT encode virion genes, but do encode conjugation machinery genes. These are labeled as conjugative elements.

With that said, --filter_out_plasmids True still uses plasmid hallmark models for the search, but strikes genes with top hits to them from the record. With this setting, a contig with only plasmid hallmark gene(s) would not be present in the final output, but a contig with both plasmid hallmark gene(s) and virus hallmark gene(s) would be "kept".

A final point is that this whole issue only really comes into play when using -db standard, which includes hallmarks that are made from replication genes of DNA viruses/plasmids. Using -db virion (recommended for WGS metagenomes and bacterial genomes) looks for sequences with genes encoding virion structural proteins, and -db rna_virus only looks for RNA RdRp and capsid genes, for which there is almost no chance of overlap of RNA virus parts with plasmids/conjugative elements.

That was a bit of a rambling reply, but please let me know if this was helpful.

Best,

Mike

aag1 commented 3 years ago

Dear Mike,

Thank you very much for the detailed reply, now I understand how Unlimited Breadsticks and Cenote-Taker2 approach plasmids and conjugative elements, it is very helpful!

Kind regards, Anastasia

mtisza1 commented 3 years ago

Great!