Clarification about how CT2 version 2.0.1 identifies a contig as a "Conjugative Transposon" in the *.tsv file?

RachelRodgers commented 3 years ago

Hi, I'm looking for some clarification on how contigs are annotated as a "Conjugative Transposon" both from a technical and biological perspective.

We have been analyzing data run through version 2.0.1, so that is the version of the code I have been studying.

I'll start with my understanding of the technical part to make sure I've got the flow of the code right. I tried to follow the path of a contig that ends up as a "Conjugative Transposon" in the final output report, and here's my understanding:

during the "guess taxonomy" step, blastx writes out a results file *.tax_guide.blastx.out
during the "combine tbl files" step, TAX_ORF variable is assigned as a "Conjugative Transposon" if CONJ_COUNT > 0 and STRUCTURAL_COUNT == 0 when grepping the .tax_guide.blastx.out file. TAX_ORF is then written over the .tax_guide.blastx.out file
In the section "Getting info for virus nomenclature and divergence" the variable tax_guess is assigned from the *.tax_guide.blastx.out file
tax_guess is written to /sequin_directory/*.fsa
information from .fsa is pulled into the .tsv report file

Does this seem correct?

My biological colleagues have broader conceptual questions: (1) Why is "taxonomy" reported for phage predictions, but not for CTns (why do we not annotate CTn predictions like we do for phages)? (2) What database of proteins was used for the CTn assignments (assuming it is a protein similarity approach?), and what are the minimum criteria for making that assignment (how many genes/proteins?)

I'm not sure how to answer either of these questions outside of stating what the code is doing!

Any help is greatly appreciated, thanks so much!

mtisza1 commented 3 years ago

Hi Rachel,

Thanks for opening the issue.

You have the technical part down exactly! Bravo for going through and understanding the code that, especially in the older version, is pretty Byzantine.

(1) I'm not sure I fully understand what you are asking. Are you asking why Cenote-Taker2 doesn't systematically identify CTns? If so, it's because I'm not an expert on this, and I don't know if I could do it systematically. I'm just trying to give the cleanest set of virus contigs as possible. Phages and CTns often have homologous replication genes and this sometimes causes CTns to get "caught in the nets" when using a hallmark gene approach that contains virus replication genes. Since we just want the phage, CTns are false positives. Using the above mentioned downstream approach to identify contigs without virion structural genes but with conjugative machinery is an additional layer of security against false positives. However, since this approach looks at gene content, it requires the contigs to be fully annotated. And, since the contigs are identified as CTns are already fully annotated, I just spit them out in the final output so users can visually inspect the genome maps.

Or, are you asking why CTns are not subdivided into a CTn taxonomical hierarchy? If so, it's because I wasn't aware that there was a formalized taxonomy of CTns. Maybe another tool like IslandViewer 4 does this?

(2) The databases used are CDD/Pfam and PDB. I have been using a basic grep command on the feature table

CONJ_COUNT=$( grep -i "virb\|type-IV\|secretion system\|conjuga\|transposon\|tra[a-z] \|trb[b-z]\|pilus" $feat_tbl2 | wc -l )

Looking a little further into it, I think this grep gets basically all the conjugation machinery models, but does find a number CDD models that are not conjugation machinery, so I will shore this up in future updates. While I assume that conjugative transposons all have multiple genes involved in conjugation, I only require 1 conjugation gene (and 0 virus virion genes), as most metagenomic contigs of phage/CTns are short, incomplete fragments.

Does this help, or am I misunderstanding your question?

Mike

RachelRodgers commented 3 years ago

Thanks so much for the thorough (and super quick) response. I've passed off this info to my more biologically-savvy colleagues and will see if they want more clarification.

We are in desperate need for a tool like CT2, so I will be upgrading to the latest version today and spending time reading through the new code as well (and of course running tons of samples through). It will probably be our go-to for all our phage-y studies for the foreseeable future since we have had such good results from the program before :) Thanks again Mike!

mtisza1 commented 3 years ago

OK great. It is fantastic that you are liking the tool! Happy to answer any other questions about how to interpret the results, or which settings to use. For example, once you update to v2.1.1, you might want to use -db virion for metagenomic sequencing or bacterial genomes, as this database should avoid getting CTns "caught in the nets".

I'll leave this issue open for a while in case you come back with additional questions on this topic.

RachelRodgers commented 3 years ago

Thanks Mike. There were two more questions:

Does CT2 make an attempt to identify any closely related sequences (analogous to the phage BlastP "taxonomy")?
Are the CDD/Pfam hits for the CTns saved in any of the output files?

mtisza1 commented 3 years ago

(1) No it does not do this. If you are aware of a well-curated database of CTn sequences, I could possibly add this feature, depending on how it went in testing.

(2) No, but this is a great idea. I'll definitely add this is the next update (v 2.1.2, coming in the next week or two).

Best,

Mike

RachelRodgers commented 3 years ago

Hi Mike, I was asked to pass this along:

this is the only CTn database I know about, last updated in 2019:

https://db-mml.sjtu.edu.cn/ICEberg/https://db-mml.sjtu.edu.cn/ICEberg/

I think this clarifies the conceptual questions for now - I may have some technical ones coming up on the newer version! Thanks again for all your help! Much appreciated

RachelRodgers commented 3 years ago

Me again (sorry).

I upgraded to the latest version and have been running a variety of samples through. I notice the output .tsv file no longer includes Element Name but instead info about the hallmarks. I was curious why that is, and if you think it's still appropriate if I want to pull the tax guess from the .tax_guide.blastx.out files? My main focus is to quickly identify what's in a sample from these *.tsv files and leave the hard work to the smart people, and I was mainly doing this by looking at those Element Name columns.

mtisza1 commented 3 years ago

No problem. You are right that the tsv summary doesn't include the Element Name/organism name. I could see how that would be useful, so I'll include that in the forthcoming v2.1.2 update.

In the mean time, that information is in the header of the corresponding (based on the 2nd column of the tsv file, i.e. CENOTE_NAME) .fsa file in the sequin_and_genome_maps/ directory. The format in the header is [organism=SOMETHING sp. ct1234]. You could also pull it from the .gbf file.

mtisza1 commented 3 years ago

Hi Rachel,

Just wanted you to know I've pushed a new release (v2.1.2) that includes most of the features you asked about. Genes flagged by Cenote-Taker 2 as conjugative machinery will be in a .gtf file in the sequin_and_genome_maps directory corresponding to the sequence. Also, organism/taxonomy names are in the output .tsv file now. Please check the release notes for other details.

Best,

Mike

RachelRodgers commented 3 years ago

Sweet, thanks!

mtisza1 commented 3 years ago

Hi Rachel,

Not sure if there is something weird going on on my end, but I received an email about your issue, but I can't see it on the github issues page:

Unrelated, issue, but...I keep getting the message "bedtools is not found" after installing the latest version. Also saw a small typo in the log script that says: "Cenote-Taker2 should now run. Use: python /path/to/Cenote-Taker2/run_cenote-taker2.0.1.py." But think it should just be run_cenote-taker2.py. No big deal tho.

Regarding the bedtools issue, I'm curious whether you are getting this error at the "pre-check" stage or when bedtools is actually trying to be used. Are you getting the "not found" error at the beginning of your run with the message "bedtools is not found. Exiting.", so that the run quits before anything happens in regard to virus discovery/annotation? Or, are you getting it near the end of the run and everything seems to run to completion? Also, are you able to activate the cenote-taker2_env conda environment and run the command bedtools intersect? A log file with the error would be great if you have one on hand.

I fixed the message in the install script, and will push the update momentarily. Thanks!

Mike

RachelRodgers commented 3 years ago

Hi Mike - yeah I deleted my comment because I figured it out like a second later. I simply hadn't run the extra commands for installing bedtools. It's working fine now! Our cluster periodically trashes files older than 60 days so when I forget to run a recursive touch, bits and pieces of the installation get removed and sometimes it's easier to re-install. But I thought something was totally messed up. Just me being dumb. Thanks!

mtisza1 / Cenote-Taker2

Clarification about how CT2 version 2.0.1 identifies a contig as a "Conjugative Transposon" in the *.tsv file? #7