nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
137 stars 7 forks source link

Request: additional output columns for extract #270

Open OberonDixon opened 1 day ago

OberonDixon commented 1 day ago

Hi @ArtRand,

Per our recent conversation at the T2T consortium meeting, I was hoping for the option to add a few extra columns to the extract output in order to improve downstream processing in our new dimelo package, which uses modkit as part of the backend. The fields I currently would like to see are the following (my naming could probably be improved):

read_start_in_ref: 
    the start of the read in the reference genome coordinates. Currently this can be roughly inferred by the 
    ref_position and read_position columns, but any indels in the read relative to the reference can mean 
    the approximated reference coordinate start for the read is a little bit off, and off by different amounts 
    for different modification motifs.

read_end_in_ref: 
    the end of the read in the reference genome coordinates, for similar reasons to read_start

motif: 
    to make it easy to extract multiple potentially overlapping motifs into the same .txt file, then identify them 
    properly later, it would be beneficial if in addition to the kmer context we could get the specific motif that 
    the line in question was queried with. Currently, I would need to re-implement a motif finder in the 5mer, 
    including handling ambiguous cases like H or D, which you've clearly already got under the hood, so I'd
    prefer to not replicate existing logic and instead get a readout I can parse.

Let me know if there's anything I can clarify! Best, Oberon

ArtRand commented 1 day ago

Hello @OberonDixon,

Nice to hear from you. These all seem like good suggestions. I'll send you a test build once I have it. Thanks!