Closed Psy-Fer closed 2 years ago
Hi James,
You're right - there is an issue in the digitisaion conversion to adc range - I will get a release in to fix this asap.
WRT keeping range + digitisation, what is the need for these fields explicitly?
As you say, you can calculate digitisation (when everything is working) using: adc_max - adc_min
, and range can be derived using: calibration.scale * digitisaion
.
We have chosen to store scale and offset in the pod5 format, as its the value actually needed to convert ADC values to pA.
However, if there is a user need to actually know range and digitisation we could add it to the format - or add methods to the reader to recover the data?
Thanks,
Hello George,
These 2 steps in the pA conversion
Keeping scale and offset in pod5, means you can skip straight to step 2. However step 1 was never data-intensive, and multiple 3rd party tools expect range and digitisation as inputs for pA conversion. This adds an extra complication when it comes to integration. Normally, when integrating a file format that contains the same data, you want to just write the data ingester, and then match the fields with the internal data structure being used in the software, and you don't have to change anything else other than maybe input arguments.
In this case, integration would have to add some switch for using pod5 and skipping step 1.
It's not a huge deal, it's mostly around design in the existing ecosystem for ease of adoption.
One last, though probably the more important factor. If you do it this way, fast5->pod5 becomes non-reversible. As there is no way to get the range or digitisation values going backwards pod5->fast5. I know that isn't high on the priority list for moving forward, but from a scientific reproducibility point of view (and probably troubleshooting and growing pains), it is very important.
So it would be good if they could be included, or at least calculated from whatever is in pod5.
Cheers, James
Hi @Psy-Fer ,
Thats really good info thanks, I will immediately add accessors so the values can be extracted - as you say with some potential loss.
I will also discuss with the team internally around which fields can be stored in the file, and I will refresh my knowledge on what minknow handles internally - if we internally use offset
+ scale
, then using on disk then storing range
on disk seems pointless.
Thanks,
Hi @Psy-Fer ,
I hope 0.0.17 has resolved a lot of these issues - please let us know any feedback!
Hey George,
Yep that fixed it. Thanks for that.
Cheers, James
Hey George,
Just re-opening this one. Any chance we can get the same hooks you fixed up for python API also put into the C API?
So we can get the range and digitisation values using the C API ? (python ones are working great).
If I've missed something and they are there and I've just missed them, please let me know where I can find them. Otherwise, yea, could we get them added?
Cheers, James
Hi James,
Good shout - I will get the C API updated with these values too
Hi @Psy-Fer ,
There is a new API call available in 0.0.21 with these values present:
https://github.com/nanoporetech/pod5-file-format/blob/master/c++/pod5_format/c_api.h#L221
Thanks,
Thanks George, I'll give it a go and let you know. Appreciate the turnaround on this too.
Hello,
When I convert a set of fast5 files to pod5, the
adc_max/min
values are zeroThe description of these fields states that the
digitisation
comes from themax-min
of these values, however, they are zero in all of my reads, so I can't calculate the expected 2048.0An alternative way to calculate
digitisation
is by knowing theadc_range
; however, when a fast5 file is read by ( https://github.com/nanoporetech/pod5-file-format/blob/dcc0b99a45f742f06fe45d7d99f4dc8a0255e5a7/python/pod5_format/pod5_format/writer.py ) , this value is used to calculate thescale
with thedigitisation
, and only thescale
is recordedadc_range
is discarded.Is it possible to maintain the
adc_range
value in the conversion step, or ideallydigitisation
andadc_range
?data dumps below
Cheers, James
[types.RunInfo.fields.adc_max] type = "int16" description = "The maximum ADC value that might be encountered. This is a hardware constraint."
[types.RunInfo.fields.adc_min] type = "int16" description = "The minimum ADC value that might be encountered. This is a hardware constraint. adc_max - adc_min is the digitisation."