Open Psy-Fer opened 1 year ago
Hi James,
num_minknow_events
is a count of the number of internal minknow events in the file, this can be used to estimate bases in the file, bu multiplying by some conversion ratio.
The scale and shifts are pairs of numbers which relate to the level of the signal of the read. They are derived from some of internal metrics minknow builds as the experiment progresses. The intention is to make them available for scaling in the basecaller. The tracked value is based on previous reads from the same channel/mux, so is complex to recalculate. Dorado doesn't use the values at present.
num_reads_since_mux_change
and time_since_mux_change
are also intended for use in a downstream analysis pipeline where you need to decide which scaling parameters to use.
Alternatively if not needed, then is it okay to provide these with some appropriate null type when writing the pod5 read?
MinKNOW won't ever produce reads with null values here, but our fast5 convertor will use "nan" values in place for tracking values, and 0 for num_reads_since_mux_change
and time_since_mux_change
. These seem like the safe null replacement to me.
Hey George,
As always, very helpful.
For all the fields, are they calculated by MinKNOW and provided to the pod5 writer as is, or is there some extra calculation that happens?
To clarify, I imagine there is some pod5 read object that is created by minknow, and that is handed over to the writer module to handle. I'm trying to get a handle on where these values are calculated. Before or after being handed over to the pod5 writer.
Thanks
James
The values are calculated deep in the minknow analysis engine, where we have context over the ongoing state of each channel.
Hope that helps,
Yep, that helps.
That's all for now. I'll be sure to ping again if I run into issues. As for now, I've got a working s2p and p2s converter for pod5<->slow5. Doing some thorough checks before I release it I appreciate the help so far. Cheers, James
Hey,
sorry just re-opening this for a quick related question.
[fields.num_minknow_events]
type = "int8"
description = "Number of minknow events that the read contains"
The spec shows this field type as int8
So between -128 and 127
Is that big enough to contain the number of minknow events in all cases? (again, I still don't quite understand what an "event" is in this context.
Also, are events ever negative?
Cheers, James
Hmm,
I think the docs are wrong.
https://github.com/search?q=repo%3Ananoporetech%2Fpod5-file-format+num_minknow_events&type=code
Can you please update?
also, I think this should be returning an int
rather than a float
although not sure what the .as_py()
is doing to the type here.
https://github.com/nanoporetech/pod5-file-format/blob/56bc9f773654801750c470d51fa22240a732de5a/python/pod5/src/pod5/reader.py#L133-L139
I'll use a uint64
for that field for slow5.
Cheers, James
I agree - I'll update.
Cheers,
I have also updated the pyslow5 API to handle those fields above, and my converter is now working pretty well.
Now time to move on and make it go fast.
Thanks for your help. James
@jorj1988
I have three questions regarding the scaling/shift values. I guess these values are going to be used later in basecallers instead of z-score or quantile scaling, possibly because these are better alternatives to handle reads from genomics regions containing unbalanced % of ACGTs.
Hey again,
I wanted to know a little more about the following fields.
Specifically:
num_minknow_events
: What is this actually? is it the number of signal chunks in the signal table? or something else? What is an event? how are they detected? What is this used for?These 4, I kinda understand what they are. But how are they actually calculated? are they used for anything? Do any ONT tools use them? if not, are there plans to use them/why are they captured and stored?
lastly
I understand what these are, pretty self-explanatory, but why are they tracked? Are they used for something?
Converting from pod5->slow5 I can always store this stuff in aux fields if it makes sense to do so, but if they are not used for anything I don't see the point in storing it at all other than for completeness/lossless conversion.
If going from slow5->pod5, if the fields are needed, then I need to know how they are calculated so I can redo this if a slow5 file doesn't contain them in the aux fields already. Alternatively if not needed, then is it okay to provide these with some appropriate
null
type when writing the pod5 read?Thanks for any insight you can give me.
Cheers, James
P.S. When I'm done with my converter, it would be good to have a chat, as I have some thoughts about the API and internal data structures, and it would be good anyway to catch up on how things are going with pod5/slow5