Closed zihengluo closed 3 weeks ago
Hi Linxing,
I processed 614 viral contigs along with their corresponding metagenomes using COBRA with the following command:
cobra-meta -f metagenome.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a megahit -mink 21 -maxk 141
The output resulted in:
- 54 self-circular contigs
- 11 extended circular contigs
- 181 extended partial contigs
- 157 orphan contigs
- 211 failed contigs
I then manually checked the terminal direct repeats (TDRs) for all the circular genomes and observed that:
- 45 out of the 54 self-circular genomes contained TDRs of exactly 141 base pairs.
- All 11 extended circular genomes also had TDRs of 141 base pairs.
I am concerned that the identification of TDRs may be biased due to the '-maxk 141' parameter, as it seems unusual that almost all TDRs are of the same length. In my understanding, the length of TDRs should vary randomly rather than being fixed.
Could you please clarify whether this fixed TDR length is a consequence of how I used COBRA or if there is another explanation for this observation? Additionally, do you have any recommendations to improve the accuracy of identifying circular genomes?
Best regards, Ziheng
Hi Ziheng,
Thank you for the questions, which are great ones.
The real DTR (direct terminal repeat, not TDR) is a biological feature of some viruses (you could read some literature if interested), but not all. However, the DTR in sequences is another thing that is generated because of the assembly. The length of the "DTR" in sequences is usually the maximum kmer used in de novo assembly; however, in some cases (rare), you will see other lengths, generally will be shorter than the maximum kmer size. Thus, we never call the overlap between two ends of the sequences a DTR in COBRA.
Regarding how to improve the accuracy of identifying circular genomes, checking overlap between ends is one thing you can do, but not 100% reliable, thus COBRA also checks if there is/are paired-end reads spanning the two ends of the genome (example below).
Please let me know if you have any other concerns.
Best, LINXING
Thanks for your clear explanation!
Hi Linxing,
I processed 614 viral contigs along with their corresponding metagenomes using COBRA with the following command:
cobra-meta -f metagenome.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a megahit -mink 21 -maxk 141
The output resulted in:
54 self-circular contigs
11 extended circular contigs
181 extended partial contigs
157 orphan contigs
211 failed contigs
I then manually checked the terminal direct repeats (TDRs) for all the circular genomes and observed that:
45 out of the 54 self-circular genomes contained TDRs of exactly 141 base pairs.
All 11 extended circular genomes also had TDRs of 141 base pairs.
I am concerned that the identification of TDRs may be biased due to the '-maxk 141' parameter, as it seems unusual that almost all TDRs are of the same length. In my understanding, the length of TDRs should vary randomly rather than being fixed.
Could you please clarify whether this fixed TDR length is a consequence of how I used COBRA or if there is another explanation for this observation? Additionally, do you have any recommendations to improve the accuracy of identifying circular genomes?
Best regards, Ziheng