mtisza1 / Cenote-Taker3

Discover and annotate the virome
MIT License
32 stars 1 forks source link

extra-long incorrect prophage calls #6

Closed k6logc closed 5 months ago

k6logc commented 8 months ago

Hi Mike,

Thank you for Cenote-Taker3! We liked Cenote-Taker2 and are excited to explore your update. In our early runs we've noticed some examples of extra-long regions (>2,000,000 bases) being incorrectly called as prophages. In general our runs look reasonable and we haven't yet sorted out why this is happening in some cases. Do you have any insight into what might be going on?

We have v3.3.0 and the updated databases, and are running as: cenotetaker3 -c $inDIR/$1 -r $modified_base -p T --lin_minimum_hallmark_genes 2 --cpu 6 --cenote-dbs $ct3DBpath

Examples of genomes that yield wonky hits: https://www.homd.org/ftp/genomes/PROKKA/V10.1/fna/SEQF1065.1.fna https://www.homd.org/ftp/genomes/PROKKA/V10.1/fna/SEQF9972.1.fna

Any feedback or guidance much appreciated, thank you!

-Kathryn (& looping in @AmrutaIdagunji who is working on this)

mtisza1 commented 8 months ago

Hi Kathryn,

Thanks for opening this issue. I just tested the SEQF1065.1.fna file, and I found the problem. CT3 is behaving as expected, in this case, the sequence "SEQF1065.1|CP003503.1 HMT-643 Prevotella intermedia 17" has DTRs/Direct Terminal Repeats. Therefore, CT3 treats it as circular and therefore does not prune. See log message "assess_virus_genes1.py: No non-DTR virus contigs >= 10,000 nt. So pruning will not happen" when running this file.

Would it be useful to you if I added a flag to force pruning on DTR sequences? I'm planning some small updates and I believe I can implement this.

Let me know,

Mike

k6logc commented 8 months ago

Hi Mike,

Thank you very much for checking this out and for your quick diagnosis. Yes, it would be fantastic if you could add the option to force a prune on DTR seqs. In the case of SEQF1065.1 we expect to have two 30-40kb prophages predicted on that contig and it will be great to see if the force prune recovers them. Do you have a ballpark sense on timeline for your next planned update?

Many thanks!

Best, Kathryn

mtisza1 commented 8 months ago

Kathryn,

I'm happy to help. I would anticipate the update to be live in about 1 week.

I know that this doesn't help from a pipeline perspective, but if you are just curious about the few cases you mentioned in your earlier post, you can remove the DTR sequence from the beginning or end of your contigs before feeding it to CT3. You can find these sequences in CT3 output files, e.g. "final_genes_to_contigs_annotation_summary.tsv"

Mike

k6logc commented 8 months ago

Hi Mike,

Great about the timing on the update! We are looking forward to reflecting those changes in our pipeline as well. And thank you for mentioning about trimming the DTRs for seeing what's going on in those cases, makes sense.

Thank you!

Best, Kathryn

mtisza1 commented 7 months ago

Hi Kathryn,

I've put a new version v3.3.1 on Bioconda. see release.

I've added the flag --max_dtr_assess (which defaults to 1,000,000). This allows you to set the length at which CT3 stops looking for DTRs. This way, assemblies of complete circular bacterial chromosomes with prophage will get pruned instead of treated like circular/DTR viruses.

Please update to v3.3.1. This should do it:

conda activate ct3_env

mamba install bioconda::cenote-taker3=3.3.1

Let me know if this fixes your issue.

Mike

k6logc commented 7 months ago

Hi Mike, Fantastic, thank you!! Will report back. Best, Kathryn

mtisza1 commented 5 months ago

Kathryn, I'm going to close this issue with the assumption that this new function is working for you. Please reopen it if that's not the case.

Best,

Mike