mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
56 stars 7 forks source link

DTR removing #40

Closed ARastrojo closed 1 year ago

ARastrojo commented 1 year ago

Hi all,

I am using version 2.1.2 and I do not know if this was corrected in newer versions. When I search for circular genomes assembled with De Bruijn Graph based assembler I expect to find a duplicated sequence in both contigs ends the same size as the Kmer used for the assembly. These duplicated sequences are produce by the assembler as it can not be possible to "decide" where to continue (I think this is call a "buble" in graph theory). Nevermind, I know the "real" genome has only one copy of this sequence, and one of them should be remove. Here, in Cenote-taker, this is call DTR (Direct Terminal Repeat) and after the rotating process I would expect to have one copy remove from the final genome, but this is not happening, at least in version 2.1.2. I have check that several circular genomes in sequin_and_genome_maps folder contains the 2 copies of the repeat produce by the assembler.

Was this corrected in newer version? I could no find any information on that.

Thanks,

Alberto

mtisza1 commented 1 year ago

Alberto,

Thanks for reaching out. Yes, this was a bug I found that effected v2.1.2 and probably a few earlier versions. I fixed this for v2.1.5, and added the --wrap option. When you pass --wrap True (default option), the DTR will be clipped and the sequence will be rotated. When you pass --wrap False, the sequence will not be clipped or rotated, but the DTR sequences will be reported as a feature in the .gbf file. It looks like I forgot to report the bug fix in the release notes.

Relatedly, you can pass --exact_dtrs True to require that the 5' and 3' DTR sequence matches exactly (the default mode is based on bitscore of the alignment.

To update to v2.1.5:

conda activate cenote-taker2_env
pip install phanotate
conda install -c conda-forge -c bioconda hhsuite last=1282 seqkit

### you may have to use conda to install biopython and bedtools as well ###

cd Cenote-Taker2
git pull

python update_ct2_databases.py --hmm True --protein True

I hope this was helpful for you!

Best,

Mike