mtisza1 / Cenote-Taker3

Discover and annotate the virome
MIT License
32 stars 1 forks source link

Coverage calculation output and general question #4

Closed charlesfoster closed 5 months ago

charlesfoster commented 9 months ago

Thanks for the really interesting tool. I've run it on a metagenomics assembly and I was surprised by how fast it was, and the output looks great! I just have a couple of questions...

My command for the analysis was: cenotetaker3 -c assembly.fa -r ct3_len500 --minimum_length_linear 500 --minimum_length_circular 500 -p T --reads reads_R1.fq.gz reads_R2.fq.gz.

When running with the --reads flag, does the pipeline take paired end information into account somehow? And should a coverage estimate be included in the final output files? For now, I can only find an ?intermediate file in the ct_processing directory called oriented_hallmark_contigs.pruned.coverage.tsv.

Within this file are the following results for one contig:

rname | startpos | endpos | numreads | covbases | coverage | meandepth | meanbaseq | meanmapq

-- | -- | -- | -- | -- | -- | -- | -- | -- ct3_len500_1 | 1 | 8032 | 1104152 | 8032 | 100 | 17034.1 | 38 | 2.39

Everything seems great except for the meanmapq, which I found surprising initially. Then I remembered that in this case I'd assembled the reads with both megahit and spades-meta and combined the outputs. Both assemblers assembled this particular virus well, and the highly similar contigs were leading to the low mapq. If using multiple assemblers, would you recommend running cenotetaker3 on the resulting assemblies separately? Or running some kind of contig de-duplication step first?

Finally, as you can see above, I set the minimum contig length as 500 nt. Is there a particular reason that the default threshold is 1000 nt, or is it more that there needs to be some kind of threshold and 1000 seemed sensible?

Thanks!

mtisza1 commented 9 months ago

Hi Charles,

Thanks for the nice comments and I appreciate the feedback and questions.

1) Cenote-Taker 3 uses minimap2 for read alignment, so it may use paired-end info depending on how you specify the reads: see here.

In your case, it seems like it is using paired-end info. Here's the piece of code from Cenote-Taker 3. The ${READS} variable is just a space-separated list of files from --reads argument.

minimap2 -t $CPU -ax sr ${TEMP_DIR}/oriented_hallmark_contigs.pruned.fasta ${READS} |\
    samtools sort - | samtools coverage -o ${TEMP_DIR}/mapping_reads/oriented_hallmark_contigs.pruned.coverage.tsv -

2) You can see that there is nothing too complicated about the read-mapping step, so I would imagine you are correct that the meanmapq score is so low due to duplicate contigs. You could probably split the contigs from the Cenote-Taker 3 output based on their assembler (regex using the header names that megahit and metaspades produce). Then, you could redo the minimap2/samtools steps with your reads and check the meanmapq values.

You can get the read depth values for each virus contig in the .gbf and .cmt files in the sequin_and_genome_maps directory. I will add this as a field in the output table in the next update (not sure when this will happen).

I understand you want to get the best contigs from your metagenome and are therefore combining megahit and metaspades contigs. This is kind of a tricky proposition because deduplication might not work 100% of the time. Depending on where the contigs "break", you may have a contig from megahit that aligns a bit with two different contigs from metaspades or visa versa. In this case, your deduplication pipeline wouldn't resolve the complexity and you'd have some duplicated sequences in your final contig set. There's no easy solution here, AFAIK.

I'm by no means endorsing this tool (I have not tried it and can imagine a lot of artifactual results), but COBRA tries to improve metagenomic virus contigs.

3) Feel free to use whatever minimum contig length you are comfortable with. 1000 is just a nice round number that I use. You should consider the goals of your project (are 500nt contigs going to forward these goals?).

Good luck and I hope this helped.

Feel free to follow up.

Mike

charlesfoster commented 9 months ago

Hi Mike,

Thanks for the useful and comprehensive reply, and apologies for my slow reply. All makes sense, and the tool has nicely slotted into my workflows.

Cheers, Charles

charlesfoster commented 9 months ago

Sorry, just an additional note that the coverage is not seeming to propagate through to the .cmt and .gbf files. For example, this info is in the ct_processing/mapping_reads/oriented_hallmark_contigs.pruned.coverage.tsv file:

rname | startpos | endpos | numreads | covbases | coverage | meandepth | meanbaseq | meanmapq

-- | -- | -- | -- | -- | -- | -- | -- | -- ct3_1 | 1 | 9379 | 176187 | 9379 | 100 | 2535.66 | 37.7 | 56.8

Yet the .cmt file says "Genome Coverage x", and the .gbf file says "Genome Coverage :: x".

Also, I notice that often the 'organism' field in the summary will say something like "Hepacivirus sp. ctQ24PE". Is the string in bold just randomly generated? Thanks

mtisza1 commented 9 months ago

Charles, thanks for following through on this. Clearly, parsing of the coverage calculation was not tested thoroughly on my end!

I'm going to revamp that module to make it more robust and add the coverage value to the summary table. I hope to have this feature in a new release on bioconda within a week.

To your other question: Yes, the organism string (e.g. "Hepacivirus sp. ctQ24PE") is randomly generated. The idea is to give all virus sequences a concise and unique tag if they end up in GenBank.

Best,

Mike

mtisza1 commented 8 months ago

Charles,

I believe your issue has been fixed in the new release.

The coverage is now correctly displayed in the .cmt files and the coverage is included as a column in the virus summary table.

v.3.3.0 should be live on bioconda by the end of the week (takes a couple days for it to go through, usually).

Please let me know if you have any other issues or questions.

Mike