skovaka / uncalled4

MIT License
43 stars 4 forks source link

Usage advice #24

Open cnk113 opened 4 months ago

cnk113 commented 4 months ago

Hi Sam,

Thanks for all the effort on Uncalled4, this is really a great package. I've noticed you started working on readstats on the dev branch which would be helpful for what I'm trying to do, which is consensus correction using the features outputted per read. I was wondering aside from current and dwell do you perhaps have an intuition on what other features are relatively independent from the sequencer that is useful? Also in terms of pulling the current and dwell time do you have a more efficient way to pull the information from the BAM tag? I'm having trouble pulling it per read per-base resolution for current/dwell.

Thanks, Chang

skovaka commented 4 months ago

I think uncalled4 convert --bam-in <file.bam> --tsv-cols dtw.current,dtw.current_sd,dtw.dwell --tsv-out might accomplish what you want? See alignment layers for more --tsv-col options. I should expose readstats too, but the idea behind that is to output statistics associated with an entire read, not per-position-statistics, which doesn't seem very useful and I mainly planned to include it for symmetry. If you have an application for per-read statistics I could finalize readstats sooner though.

You can also iterate over alignments in Python, which might be slightly more efficient than outputting to a TSV, although the module isn't very user-friendly currently. I might work on refining it to provide an actual Uncalled4 API, but for now you can use this code:

from uncalled4.tracks import Tracks
from uncalled4 import Config
import sys

bam_in = sys.argv[1]

conf = Config()
conf.tracks.io.bam_in = bam_in
tracks = Tracks(conf=conf)

for read_id, aln in tracks.iter_reads():
    print(aln.read_id)
    print(aln.to_pandas(["dtw"]))                                     #print basic DTW statistics
    print(aln.to_pandas(["dtw.dwell", "dtw.model_diff"])) #print derived statistics 

tracks.close()
cnk113 commented 3 months ago

Thanks for the snippet! I think misinterpreted your dev branch, it looks to be signal to the basecall alignment? So would this mean that the alignment is done on the entire read such as over the primer sequence/barcodes (if there)/etc.? If so this would also be very helpful to correct across UMI sequences which don't end up getting aligned, usually it's in the clipped read and as a BAM tag.

skovaka commented 3 months ago

I recently implemented signal-to-basecall alignment in the dev branch, but it should just be an option currently available via the "resquiggle" command. The standard "align" command should still perform signal-to-reference alignment, and if not that is an error which I should fix. It's possible such an error could occur if Uncalled4 can't find the reference which the reads were aligned to, which might I might need to add a check for.

cnk113 commented 3 months ago

So I've been fiddling with the API further, I just wanted to clarify if the seq layer current values are what the dtw mean current averages over? I know the tool isn't really meant for per-base statistics but I'm trying to extract as much information possible. Also I'm assuming the seq.kmer value is just the index of a kmer combination?

skovaka commented 3 months ago

seq.current is the "reference" current derived from the pore model that the signal is aligned to, so seq.current - dtw.current == dtw.model_diff. And yes, seq.kmer is the kmer index. I think tracks.model.kmer_to_str(seq.kmer) should translate to bases, or you may need to wrap the inner part with np.array(seq.kmer) (I recently fixed that but not sure when I pushed the change)

cnk113 commented 3 months ago

I see thanks for the clarification, is there perhaps a way to interpolate the kmer current signal back to a per base current value?

skovaka commented 3 months ago

I'm not sure if this is what you mean, but this will give you the read current, reference current, and centeral base for each k-mer: aln.to_pandas(["dtw.current","seq.current","seq.base"]). (seq.base is an integer where A=0,C=1,G=2,T=3)

The value of seq.current is always determined by the full k-mer though, not individual bases. You could generate a "1-mer" model by averaging current for k-mers which share the central base, but that would be a very low-resolution pore model