nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Other
197 stars 30 forks source link

Reusing modified basecalling results #300

Open ziczhang opened 2 years ago

ziczhang commented 2 years ago

Hi,

I used Megalodon to call CpG with the following command

megalodon $FAST5 \
--output-directory $OUT \
--outputs basecalls mappings mod_mappings mods \
--reference $REF \
--devices $GPU1 \
--processes $THREADS \
--guppy-server-path $PATH \
--guppy-params $PARAM \
--guppy-config res_dna_r941_prom_modbases_5mC_v001.cfg \
--mod-motif m CG 0 \
--overwrite

and if I want to call another motif like mCNG, can I reused the output files of CpG's with some commands?

Thanks, Zicong

colindaven commented 2 years ago

If you've got a modified bam file from megalodon, you can use this tool to extract other modified motifs AFAIK

https://github.com/epi2me-labs/modbam2bed

ziczhang commented 2 years ago

Hi Colin, Yes, I know modbam2bed, but the motif that modbam2bed can detect is limited to CpG, CHG, CHH. So I'm looking for the way to detect more sequence motifs without having to run Megalodon multiple times.

colindaven commented 2 years ago

OK, great.

What about these (have never tried, but curious about them) ?

  -m, --mod_base=BASE        Modified base of interest, one of: 5mC, 5hmC, 5fC,
                             5caC, 5hmU, 5fU, 5caU, 6mA, 5oxoG, Xao.
ziczhang commented 2 years ago

Sorry, the sequence motif means more complex motifs like TAmCAG or somethings.

marcus1487 commented 2 years ago

I would suggest outputting all context modified bases from megalodon. Calls in new contexts cannot be generated without rerunning megalodon.

From the all context calls output, motif specific calls can be extracted with the megalodon_extras modified_bases create_motif_bed command followed by the bedtools intersect command.

ziczhang commented 2 years ago

Thanks Marcus. I have read the same suggestion from you in another issue. Your suggestion is very helpful, but in my understanding, megalodon_extras modified_bases create_motif_bed is limited to the motif on the reference, and can it detect the motif on the variants? For example, if such a mutation is present in the sample, can this C modification be detected?

Ref: NNNAGNNN ↓ Read: NNNCGNNN

marcus1487 commented 2 years ago

I would suggest that this is a custom processing request which I am not sure we can support without a larger use case. You can see the core logic to implement such a request in the remora code here.

Could you specify what other output you would require in this output format. For example such output is a per-read output, so could not have fraction modified or other aggregated results since each read will have differing basecalls at reference locations.

ziczhang commented 2 years ago

I see where I went wrong. The modified_bases.5mC.bed is called from the sequence of reference genome, not from basecalled fastq. So, if I want to find a specific motif, your suggestion of using megalodon_extras modified_bases create_motif_bed is correct, but if I also want to the methylation difference around SNPs, I have to use a modified reference genome to recall methylation, right?

Is it possible to use a previously basecalled fastq file and skip the basecall step when calling methylation using a different reference genome?