mtisza1 / Cenote-Taker3

Discover and annotate the virome
MIT License
29 stars 1 forks source link

keeping terminal repeats #9

Closed xvazquezc closed 2 months ago

xvazquezc commented 5 months ago

Hi there,

First at all, I must say this is the only program that got to annotate many of the normal viral marker genes in a very weird virus we are working on.

Second, is there a way to keep the terminal repeats at both ends after re-circularisation? Quite a few tools for downstream analysis e.g. CheckV don't like it when they are missing.

Thank you in advance, Xabi

mtisza1 commented 5 months ago

Xabi,

Thanks for your kind words.

I hope using the flag --wrap F will do the trick for you. This flag will leave DTRs at the ends. But, it will also not re-orient/re-circularize the contig.

If you still want the re-orientation behavior but want the DTRs as well, you'll have to come up with a custom downstream solution after running this tool with --wrap T. I suggest a script to read the CT3 output *_virus_summary.tsv and *_virus_sequences.fna files then add a short (e.g. 25 nt) DTR to the end of contigs with the DTR label. Then pipe these modified .fna files to checkv. You could use biopython in a python script or seqkit in a bash script to accomplish this.

Unrelatedly, if you notice that this tool is missing any hallmark genes on your very weird viruses, it would help me out if you could share those protein sequences (to be included in database updates). This would totally be up to you, of course.

Mike

xvazquezc commented 5 months ago

I completely misunderstood the wrap function :facepalm:

Without getting into details, despite being a jumbo virus according to the genome size is hard to say if it belongs to any known class. cenotetaker3 only detected 3 hallmark genes out of nearly 600 (not surprised there based on results with other tools) and one of them (MCP) I'm not convinced (PF13252.5 matches about 120 aa in a ~1700 aa prot).

I noticed that many of the evidences coming from PDB derive from complexes so it is the evidence_description. However, the evidence_accession point to specific chains in those complexes. For example, I got 3 instances of "DNA-directed-RNA-polymerase-II" but referring to PDB:4V1N_M. The accession 4V1N corresponds to "Architecture of the RNA polymerase II-Mediator core transcription initiation complex", but specifically, the subunit M corresponds to the "transcription initiation factor IIB" and not the DNA-directed-RNA-polymerase-II.