rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Nanopore Read Decontamination #84

Open LeeBergstrand opened 7 months ago

LeeBergstrand commented 7 months ago

Problem Description

We are currently decontaminating the short reads using bbduk.sh. Would it make sense to decontaminate the long reads as well?

Problem Solution

I've seen certain pipelines use minimap2 read mapping to decontaminate nanopore reads. Map the reads to the contaminant reference genome and keep those reads that don't match. Selecting the best cutoff values may be an issue.

LeeBergstrand commented 7 months ago

@jmtsuji, what are your thoughts on this?

jmtsuji commented 7 months ago

@LeeBergstrand Doing minimap2 based read mapping should work in theory, but I have never tried this and don't know about accuracy. I would be especially concerned about how accurate decontamination would be for old 9.4.1 flow cell data (5-7% error), whereas I would be more confident to try this on newer 10.4.1 (Q20 = 1% error) data. We would need to benchmark any decontamination rule on a good reference dataset or find a nice paper that performs this benchmarking. Also, I wonder if a long read decontamination rule should be turned off by default in the config even if we add the feature. Thoughts?

LeeBergstrand commented 7 months ago

@jmtsuji This makes sense to me. Mike Lynch said it is a low-priority task because of the testing required, and our sequencing lab is quite good. Would high molecular weight DNA be more challenging to contaminate? We should do it down the line, but it is probably not a top priority.

jmtsuji commented 7 months ago

@LeeBergstrand OK, sounds good to set this as a low priority task for now. Regarding high molecular weight (HMW) DNA: HMW-DNA will get you better quality assemblies in general. I haven't done any reading on this, but it seems to me that using HMW-DNA should also substantially reduce the chances of having a chimeric assembly of your target organism and a contamination sequence. (For example, even if the contamination sequence and your target organism share the same ~5 kb transposon region, ~10kb+ long reads that span the whole transposon region should allow the assembler to discriminate which reads belong to the target organism vs. the contaminating DNA.) If you are able to assemble closed circular contigs of your target organism, my guess is that, even if the reads aren't super long, chimeric assemblies with contaminant sequences should already be rare... the contamination should show up as a separate contig and not integrated into any of the circular ones that belong to the target organism, unless long regions of the contaminant and the target are identical. Does that make sense?

P.S. I imagine that you should see a trend where higher average read length correlates to lower frequency of chimeras in the assembly. It would be interesting to know at what average read length that chimeras effectively become a non-issue, for most use cases with microorganisms. For example, it might be that average read length > 20kb basically solves the chimera problem in most use cases (this would be my uneducated guess).

LeeBergstrand commented 7 months ago

@LeeBergstrand OK, sounds good to set this as a low priority task for now. Regarding high molecular weight (HMW) DNA: HMW-DNA will get you better quality assemblies in general. I haven't done any reading on this, but it seems to me that using HMW-DNA should also substantially reduce the chances of having a chimeric assembly of your target organism and a contamination sequence. (For example, even if the contamination sequence and your target organism share the same ~5 kb transposon region, ~10kb+ long reads that span the whole transposon region should allow the assembler to discriminate which reads belong to the target organism vs. the contaminating DNA.) If you are able to assemble closed circular contigs of your target organism, my guess is that, even if the reads aren't super long, chimeric assemblies with contaminant sequences should already be rare... the contamination should show up as a separate contig and not integrated into any of the circular ones that belong to the target organism, unless long regions of the contaminant and the target are identical. Does that make sense?

P.S. I imagine that you should see a trend where higher average read length correlates to lower frequency of chimeras in the assembly. It would be interesting to know at what average read length that chimeras effectively become a non-issue, for most use cases with microorganisms. For example, it might be that average read length > 20kb basically solves the chimera problem in most use cases (this would be my uneducated guess).

This makes sense. Good point. Thanks!

LeeBergstrand commented 7 months ago

Would high molecular weight DNA be more challenging to contaminate?

@jmtsuji My thought is that DNA tends to fragment over time due to naturally present DNAse enzymes and other environmental factors. So, suppose you have a good aseptic technique. In that case, any remaining contamination that enters the sequencing run from kit contamination, tools, or air should be in the Low Molecular Weight category because they have been fragmented over time. These short contaminating fragments should be filtered out during long read size selection. I don't know if this is true, but that's my intuition.

jmtsuji commented 7 months ago

@LeeBergstrand Good thoughts. I think it's important to consider the impact of different sources of contamination:

I agree that kit contamination (from the manufacturer) should be short DNA. The concentration of this kind of contaminant DNA should also be very low. If you have enough DNA for PCR-free Nanopore sequencing (which is the typical way that Nanopore-based DNA sequencing is done at the moment), then I assume that kit contamination shouldn't really be a problem (you might get a couple reads or none at all due to having so much real sample). I assume that contamination from air/tools would be similarly low in concentration (and short in DNA length), in most cases. So overall, I agree that working with long DNA (or using a relatively long min. length cutoff during analysis) would help remove this kind of contamination, but I also think this kind of trace contamination shouldn't really be an issue for ligation-based (PCR-free) Nanopore libraries. By contrast, sometimes short read sequencing is done with very low input amounts that are amplified by PCR, so in this case trace DNA contamination might be more of an issue.

If you have contamination like from a dirty lab/sampling environment (e.g., sampling or DNA extraction was done in the field or something), then I assume you will have longer DNA fragments from the non-target organisms. I imagine this would be an issue especially for metagenomics from environmental samples. If some of your own (human) DNA got into the sampling vessel, then yes, the DNA length might be shorter on average than DNA from the target organism (I am assuming you are probably shedding dead cells with degrading DNA... not sure if this is true), so in this case working with long DNA from the target organism should help. But otherwise, I think the most valuable part of using long DNA is to avoid chimeric assemblies with other fragments. Sorry for the roundabout and long-winded discussion... I haven't thought about this in much detail before, so it is interesting to discuss!

LeeBergstrand commented 7 months ago

@LeeBergstrand Good thoughts. I think it's important to consider the impact of different sources of contamination:

I agree that kit contamination (from the manufacturer) should be short DNA. The concentration of this kind of contaminant DNA should also be very low. If you have enough DNA for PCR-free Nanopore sequencing (which is the typical way that Nanopore-based DNA sequencing is done at the moment), then I assume that kit contamination shouldn't really be a problem (you might get a couple reads or none at all due to having so much real sample). I assume that contamination from air/tools would be similarly low in concentration (and short in DNA length), in most cases. So overall, I agree that working with long DNA (or using a relatively long min. length cutoff during analysis) would help remove this kind of contamination, but I also think this kind of trace contamination shouldn't really be an issue for ligation-based (PCR-free) Nanopore libraries. By contrast, sometimes short read sequencing is done with very low input amounts that are amplified by PCR, so in this case trace DNA contamination might be more of an issue.

If you have contamination like from a dirty lab/sampling environment (e.g., sampling or DNA extraction was done in the field or something), then I assume you will have longer DNA fragments from the non-target organisms. I imagine this would be an issue especially for metagenomics from environmental samples. If some of your own (human) DNA got into the sampling vessel, then yes, the DNA length might be shorter on average than DNA from the target organism (I am assuming you are probably shedding dead cells with degrading DNA... not sure if this is true), so in this case working with long DNA from the target organism should help. But otherwise, I think the most valuable part of using long DNA is to avoid chimeric assemblies with other fragments. Sorry for the roundabout and long-winded discussion... I haven't thought about this in much detail before, so it is interesting to discuss!

I've seen some interesting results from the E-DNA space regarding DNA turnover rates in certain environments. I'll give you the details on our upcoming call.

jmtsuji commented 7 months ago

Indeed, e-DNA work would be a good place to look regarding sample contamination -- nice idea. Look forward to chatting soon :-)