nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Other
193 stars 30 forks source link

Barcode support/ demultiplexing #43

Open ChloeDG opened 4 years ago

ChloeDG commented 4 years ago

Hi there, Is there a way to demultiplex simultaneously while basecalling for mod bases using Megadolon? :-)

marcus1487 commented 4 years ago

Megalodon does not currently support demultiplexing. There is no time frame for this support at the moment.

mollybrothers commented 3 years ago

Megalodon can take a list of readIDs to analyze, right? So is a possible (somewhat silly) workaround for this to run guppy basecaller and then barcoder separately from megalodon, extract the readIDs from each barcode, and then run megalodon separately for each list?

Having the ability to demultiplex in megalodon would really help with throughput!

marcus1487 commented 3 years ago

Yes. This would certainly be one workaround.

Another workaround (without the overhead of running the basecalling twice) would be to run Megalodon with only per-read outputs including basecalls (e.g. --outputs basecalls per_read_mods). Then the output basecalls.fastq could be run through a demultiplexing program. These lists of read ids could then be passed to the megalodon_extras aggregate run command to produce desired results for each barcode.

Standard bioinformatic analysis could also be used to demultiplex other Megalodon outputs in this workaround. For example mappings and mod_mappings produce a BAM output. This could thus be split by read id into a new BAM file for each barcode.

The issue with fully integrated barcoding support is that many output streams would have to be implemented for every output type (even though most would likely never be used; e.g. signal_mappings). This also adds complexity to an already pretty complex system, likely introducing bugs and more maintenance. Thus fully integrated demultiplexing is not likely to be implemented soon.

mollybrothers commented 3 years ago

I appreciate the complexity, so thank you for the additional possible workarounds!

marcus1487 commented 3 years ago

Thinking about this a bit further, it seems a solution to this might be to add barcode assignment to applicable outputs. New issues can be raised as further barcoding outputs might be requested. This would bypass the issue of opening many output streams while providing the barcoding output with hopefully minimal fuss for downstream processing.

As an initial implementation I would propose adding barcoding results to the sequencing_summary.txt and mapping_summary.txt output files and adding a read group to mapping outputs. The read group annotated SAM/BAM/CRAM output would allow splitting into barcode files via samtools split. While mods and variants outputs would not be directly supported (as this would require multiple output streams), using the barcode assignments from the mapping_summary.txt output would feed directly into the megalodon_extras aggregate run command (via the read ids option) given the per_read_mods or per_read_variants outputs. This proposal would leave out some of the other outputs from barcoding that seem less applicable (signal_mappings, per_read_refs).

Does this seem like a sufficient resolution to this issue? Still no timeline for implementation/release, just want to figure out the work involved here.

mollybrothers commented 3 years ago

Sounds reasonable to me for those that want aggregated reads separated by barcode.

I'm also very interested in getting the per_read_modified_base_calls.db and/or per_read_modified_base_calls.txt separated by barcode as well. The SAM/BAM/CRAM split-by-barcode you're proposing would allow the info from Mm and Ml tags to be separated along with their reads, so demultiplexing those values would be possible with your approach if I'm understanding correctly. But is there a way to demultiplex for the per_read_modified_base_callsfiles?

marcus1487 commented 3 years ago

Yes, the proposal would allow the mod_mappings and mod_basecalls outputs (with Mm and Ml tags) to be separated by barcode.

It might make sense to add a barcode field to the mod and variant database reads table (though this might be integrating barcoding a bit too deeply). In either case a command megalodon_extras modified_bases split_by_barcode or megalodon_extras modified_bases split_by_read_ids could be added (similar to megalodon_extras modified_bases split_by_motif).

amauryavril commented 2 years ago

Hello,

Does Megalodon now support barcoding? Beside what you already proposed, would it be possible to feed Megalodon with demultiplexed fast5? I was thinking for example to get the fast5 for each barcode with another tool such as fast5_demultiplexer (https://github.com/duceppemo/fast5_demultiplexer), and use this fast5 as input for Megalodon. Thanks!

mollybrothers commented 2 years ago

If you have a list of read IDs (as a .txt file) that correspond to a given barcode, you can feed it into megalodon using --read-ids-filename $READ_IDS flag and megalodon will only analyze those reads. You still have to call megalodon separately for each barcode.

There is also more info in #126 (which I have not tried, but looks to be successful)

amauryavril commented 2 years ago

Thank you for the quick answer! I might try that then. Otherwise, do you see something against running Megalodon from the demultiplexed fast5? This sounds easier for me.

mollybrothers commented 2 years ago

I'm not totally sure whether megalodon works for single fast5 files (which looks like the output from the demultiplexer you linked above). I would look at the megalodon documentation. I think in this situation, you would still need to either feed megalodon each list of fast5 files separately or use the suggestion by Marcus above and at #126 to demultiplex at the aggregation step.

mollybrothers commented 2 years ago

@amauryavril it does look like single fast5 files are supported (https://nanoporetech.github.io/megalodon/common_arguments.html?highlight=single%20fast5#required-argument)

amauryavril commented 2 years ago

Thank you for looking into it! I will try both methods and see what I can get.