nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Other
197 stars 30 forks source link

"megalodon_extras modified_bases create_motif_bed" does not return all of the positions #335

Open ziczhang opened 1 year ago

ziczhang commented 1 year ago

Hi,

I used the following command to produce a bed file of the location of CpGs in hg38.

megalodon_extras modified_bases create_motif_bed --motif CG 0 --out-filename motif_sites.bed UCSC.hg38.fa

and the motif_sites.bed started from

chr1    11460   11461   .   .   +
chr1    11461   11462   .   .   -
chr1    11479   11480   .   .   +
chr1    11480   11481   .   .   -

However, the actual CpGs in hg38 start at

chr1    10468   10469   .   .   +
chr1    10469   10470   .   .   -
chr1    10470   10471   .   .   +
chr1    10471   10472   .   .   -

I don't know if this is the intended behavior, but this is because "megalodon_extras modified_bases create_motif_bed" ignores lowercase sequences in the fasta file.

This can be fixed by modifying

def compile_motif_pat(raw_motif):
    ambig_pat_str = "".join(
        "[{}]".format(SINGLE_LETTER_CODE[letter]) for letter in raw_motif
    )
    # add lookahead group to serach for overlapping motif hits
-    return re.compile("(?=({}))".format(ambig_pat_str))
+    return re.compile("(?=({}))".format(ambig_pat_str),flags=re.IGNORECASE)

def compile_rev_comp_motif_pat(raw_motif):
    ambig_pat_str = "".join(
        "[{}]".format("".join(comp(b) for b in SINGLE_LETTER_CODE[letter]))
        for letter in raw_motif[::-1]
    )
    # add lookahead group to serach for overlapping motif hits
-    return re.compile("(?=({}))".format(ambig_pat_str))
+    return re.compile("(?=({}))".format(ambig_pat_str),flags=re.IGNORECASE)

in megalodon_helper.py.

Best,

Zicong