Polishing of metagenomes

jessydit commented 3 years ago

Dear all,

I am using Medaka 1.2.3 to polish bacterial genomes. To give some context, I sequenced the total DNA of an insect on a MinION 9.4 flowcell to assemble the genomes of two bacterial symbionts that co-infect this insect. The reads were basecalled using Guppy HAC and assembled using Flye --metagenome. This produced two contigs corresponding to the complete genomes of the two bacteria.

However, I have trouble getting the polishing with Medaka (v1.2.3) right for both genomes. The problem is that the bacteria are closely related and one has a much higher coverage than the other (about 200x vs 30x). Polishing of the high-coverage genome worked perfectly, but the polishing of the low-coverage genome seems problematic, maybe due to interference of the reads from the high-coverage genome in the same dataset. As such, after polishing, the 16S and 23S ribosomal RNA gene sequences of the low-coverage genome were completely changed into those of the high-abundance genome. As this is a new genome and no complete reference sequence is available, I cannot tell in how far the rest of the genome may be altered in a similar way.

I tried filtering the reads prior to polishing with Medaka, in order to use only the reads that map to the target genome with very high mapping quality. This improved the result somewhat, but did not resolve the issue completely.

Is there a way to work around this problem within Medaka? I was wondering if variant calling could help, if there is a way to extract the reads belonging to each of the bacteria?

Many thanks!

cjw85 commented 3 years ago

Hi @jessydit,

Apologies for the lack of response here. I agree with you observations and possible rationale for the difficulties you are having. Deciding which reads it is meaningful to extract a consensus from is not a problem that medaka attempts to solve.

Performing diploid variant calling might prove fruitful if there is a sufficient density of differences to partition the reads into the two contigs. If flye has managed to detect the differences and output two contigs, it is possible that the variant calling approach will work to assign the reads. On the otherhand that the alignment of reads back to the contigs is leading to merging information would indicate that this might be difficult.

You may wish to ask @fenderglass if Flye provides a way to identify which reads support which contigs.

mikolmogorov commented 3 years ago

Flye does not keep the information about read assignment, unfortunately. I think filtering reads might be a solution - I would start from keeping only (i) primary alignments, (ii) MAPQ > 30 (iii) length > 10k. Visually checking IGV alignments in problematic regions (e.g. 16S) could be helpful too.

nanoporetech / medaka

Polishing of metagenomes #303