How to aggregate many batches per-read results into site level outputs?

nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.

Other

197 stars 30 forks source link

How to aggregate many batches per-read results into site level outputs? #296

Closed liuyang2006 closed 2 years ago

liuyang2006 commented 2 years ago

Hi, I am running 10 or more batches of Megalodon programs to call per-read outputs of a single sample data for parallelization, then I want combine all batches (10 or more) per-read file, i.e., per_read_modified_base_calls.txt files, and then perform aggregation on the combined per-read results for site-level outputs.

Is there a ready function in megalodon to combine batches, or to performance aggregation on the combined per-read outputs to site-level results?

Thank you! Best, Yang

marcus1487 commented 2 years ago

See the megalodon_extras merge modified_bases to merge the modified base database files and megalodon_extras aggregate run command to aggregate from these per-read stats files. You could also consider megalodon_extras merge aggregated_modified_bases command to merge per-site aggregated results. Note that this last method will be no different in terms of results and will bypass copying per-read statistics into a single DB and save the compute to sort/index this larger database. I would highly recommend aggregating each unit and merging these results.

liuyang2006 commented 2 years ago

Great! Thank you @marcus1487! This is what I wanted.