nanoporetech / tombo

Tombo is a suite of tools primarily for the identification of modified nucleotides from raw nanopore sequencing data.
Other
230 stars 54 forks source link

Resquiggle to a subset of the genome #366

Open jcolicchio-soundag opened 3 years ago

jcolicchio-soundag commented 3 years ago

I was wondering if anyone has any recommendations for using tombo resquiggle to allign fast5 files to a subset of a reference genome rather than mapping to a whole reference. In particular, we have been doing adaptive sampling for a species with a large genome to enrich for a small fraction of the genome and now want to map the data to our reference, but were hoping to map the data just to this small subset of the genome rather than the whole thing in order to save time. However, as one might expect we are finding that with the default parameters we are getting far more reads mapping to this subset of the genome than we expect (even accounting for good enrichment due to adaptive sampling), almost certainly because reads that would map better elsewhere in the genome are mapping ~ok~ to somewhere in our subset and getting mis-mapped there instead of getting thrown out.

To get around this, I see two options:

1) Just map to the whole reference genome. Hoping to avoid this, as it will be very time and computationally intensive. 2) Adjust the resquiggle/minimap parameters to only keep map reads that map much better than the default. My intuition is to simply lower the signal-matching-score, but I was wondering if there are other parameters that make sense to tweak as well to prevent off-target reads from mapping to our subset reference genome.

jcolicchio-soundag commented 3 years ago

As expected minimap2 gives an error (Just says "Failed") while trying to make an index for the whole (~14gb) genome.