linxingchen / cobra

A tool to raise the quality of viral genomes assembled from short-read metagenomes via resolving and joining of contigs fragmented during de novo assembly.
MIT License
59 stars 10 forks source link

Query Regarding Terminal Direct Repeats Identification in COBRA Analysis #46

Closed zihengluo closed 3 weeks ago

zihengluo commented 1 month ago

Hi Linxing,

I processed 614 viral contigs along with their corresponding metagenomes using COBRA with the following command: cobra-meta -f metagenome.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a megahit -mink 21 -maxk 141

The output resulted in:

I then manually checked the terminal direct repeats (TDRs) for all the circular genomes and observed that:

I am concerned that the identification of TDRs may be biased due to the '-maxk 141' parameter, as it seems unusual that almost all TDRs are of the same length. In my understanding, the length of TDRs should vary randomly rather than being fixed.

Could you please clarify whether this fixed TDR length is a consequence of how I used COBRA or if there is another explanation for this observation? Additionally, do you have any recommendations to improve the accuracy of identifying circular genomes?

Best regards, Ziheng

linxingchen commented 1 month ago

Hi Linxing,

I processed 614 viral contigs along with their corresponding metagenomes using COBRA with the following command: cobra-meta -f metagenome.fasta -q query.fasta -o query.fasta.COBRA.out -c coverage.txt -m mapping.sam -a megahit -mink 21 -maxk 141

The output resulted in:

  • 54 self-circular contigs
  • 11 extended circular contigs
  • 181 extended partial contigs
  • 157 orphan contigs
  • 211 failed contigs

I then manually checked the terminal direct repeats (TDRs) for all the circular genomes and observed that:

  • 45 out of the 54 self-circular genomes contained TDRs of exactly 141 base pairs.
  • All 11 extended circular genomes also had TDRs of 141 base pairs.

I am concerned that the identification of TDRs may be biased due to the '-maxk 141' parameter, as it seems unusual that almost all TDRs are of the same length. In my understanding, the length of TDRs should vary randomly rather than being fixed.

Could you please clarify whether this fixed TDR length is a consequence of how I used COBRA or if there is another explanation for this observation? Additionally, do you have any recommendations to improve the accuracy of identifying circular genomes?

Best regards, Ziheng

Hi Ziheng,

Thank you for the questions, which are great ones.

The real DTR (direct terminal repeat, not TDR) is a biological feature of some viruses (you could read some literature if interested), but not all. However, the DTR in sequences is another thing that is generated because of the assembly. The length of the "DTR" in sequences is usually the maximum kmer used in de novo assembly; however, in some cases (rare), you will see other lengths, generally will be shorter than the maximum kmer size. Thus, we never call the overlap between two ends of the sequences a DTR in COBRA.

Regarding how to improve the accuracy of identifying circular genomes, checking overlap between ends is one thing you can do, but not 100% reliable, thus COBRA also checks if there is/are paired-end reads spanning the two ends of the genome (example below).

image

Please let me know if you have any other concerns.

Best, LINXING

zihengluo commented 3 weeks ago

Thanks for your clear explanation!