nanoporetech / pod5-file-format

Pod5: a high performance file format for nanopore reads.
https://pod5-file-format.readthedocs.io/
Other
136 stars 18 forks source link

adc_max/min=zero and adc_range missing, so can't calculate digitisation #6

Closed Psy-Fer closed 2 years ago

Psy-Fer commented 2 years ago

Hello,

When I convert a set of fast5 files to pod5, the adc_max/min values are zero

The description of these fields states that the digitisation comes from the max-min of these values, however, they are zero in all of my reads, so I can't calculate the expected 2048.0

An alternative way to calculate digitisation is by knowing the adc_range; however, when a fast5 file is read by ( https://github.com/nanoporetech/pod5-file-format/blob/dcc0b99a45f742f06fe45d7d99f4dc8a0255e5a7/python/pod5_format/pod5_format/writer.py ) , this value is used to calculate the scale with the digitisation, and only the scale is recorded adc_range is discarded.

Is it possible to maintain the adc_range value in the conversion step, or ideally digitisation and adc_range?

data dumps below

Cheers, James


[types.RunInfo.fields.adc_max] type = "int16" description = "The maximum ADC value that might be encountered. This is a hardware constraint."

[types.RunInfo.fields.adc_min] type = "int16" description = "The minimum ADC value that might be encountered. This is a hardware constraint. adc_max - adc_min is the digitisation."

#### read.run_info dump

run info
    acquisition_id: bfdfd1d840e2acaf5c061241fd9b8e5c3cfe729f
    acquisition_start_time: 2020-10-27 05:41:50+00:00
    adc_max: 0                                                         <-----| these are zero
    adc_min: 0                                                          <-----|
    context_tags
      barcoding_enabled: 0
      basecall_config_filename: dna_r9.4.1_450bps_hac_prom.cfg
      experiment_duration_set: 4320
      experiment_type: genomic_dna
      local_basecalling: 1
      package: bream4
      package_version: 6.0.7
      sample_frequency: 4000
      sequencing_kit: sqk-lsk109
    experiment_name:
    flow_cell_id: PAF25452
    flow_cell_product_code: FLO-PRO002
    protocol_name: sequencing/sequencing_PRO002_DNA:FLO-PRO002:SQK-LSK109
    protocol_run_id: 97d631c6-c622-473d-9e7d-3cb9297b0036
    protocol_start_time: 1970-01-01 00:00:00+00:00
    sample_id: NA12878_SRE
    sample_rate: 4000
    sequencing_kit: sqk-lsk109
    sequencer_position: 3A
    sequencer_position_type: promethion
    software: python-pod5-converter
    system_name:
    system_type:
    tracking_id
      asic_id: 0004A30B00F25467
      asic_id_eeprom: 0004A30B00F25467
      asic_temp: 31.996552
      asic_version: Unknown
      auto_update: 0
      auto_update_source: https://mirror.oxfordnanoportal.com/software/MinKNOW/
      bream_is_standard: 0
      configuration_version: 4.0.13
      device_id: 3A
      device_type: promethion
      distribution_status: stable
      distribution_version: 20.06.9
      exp_script_name: sequencing/sequencing_PRO002_DNA:FLO-PRO002:SQK-LSK109
      exp_script_purpose: sequencing_run
      exp_start_time: 2020-10-27T05:41:50Z
      flow_cell_id: PAF25452
      flow_cell_product_code: FLO-PRO002
      guppy_version: 4.0.11+f1071ce
      heatsink_temp: 32.164288
      hostname: PC24A004
      hublett_board_id: 013b01308fa78662
      hublett_firmware_version: 2.0.14
      installation_type: nc
      ip_address:
      local_firmware_file: 1
      mac_address:
      operating_system: ubuntu 16.04
      protocol_group_id: PLPN243131
      protocol_run_id: 97d631c6-c622-473d-9e7d-3cb9297b0036
      protocols_version: 6.0.7
      run_id: bfdfd1d840e2acaf5c061241fd9b8e5c3cfe729f
      sample_id: NA12878_SRE
      satellite_board_id: 013c763bef6cca9d
      satellite_firmware_version: 2.0.14
      usb_config: firm_1.2.3_ware#rbt_4.5.6_rbt#ctrl#USB3
      version: 4.0.3

### read and read.calibration

read_id: 000dab68-15a2-43c1-b33d-9598d95b37de
channel: 861
well: 1
pore_type: not_set
read_number: 261
start_sample: 3856185
end_reason: data_service_unblock_mux_change
median_before: 204.2
sample_count: 331742
byte_count: 226302
signal_compression_ratio: 0.341
scale: 0.36551764607429504
offset: -223.0
0x55555555 commented 2 years ago

Hi James,

You're right - there is an issue in the digitisaion conversion to adc range - I will get a release in to fix this asap.

WRT keeping range + digitisation, what is the need for these fields explicitly?

As you say, you can calculate digitisation (when everything is working) using: adc_max - adc_min, and range can be derived using: calibration.scale * digitisaion.

We have chosen to store scale and offset in the pod5 format, as its the value actually needed to convert ADC values to pA.

However, if there is a user need to actually know range and digitisation we could add it to the format - or add methods to the reader to recover the data?

Thanks,

Psy-Fer commented 2 years ago

Hello George,

These 2 steps in the pA conversion

  1. scale = range / digitisation
  2. pA = scale * (signal + offset)

Keeping scale and offset in pod5, means you can skip straight to step 2. However step 1 was never data-intensive, and multiple 3rd party tools expect range and digitisation as inputs for pA conversion. This adds an extra complication when it comes to integration. Normally, when integrating a file format that contains the same data, you want to just write the data ingester, and then match the fields with the internal data structure being used in the software, and you don't have to change anything else other than maybe input arguments.

In this case, integration would have to add some switch for using pod5 and skipping step 1.

It's not a huge deal, it's mostly around design in the existing ecosystem for ease of adoption.

One last, though probably the more important factor. If you do it this way, fast5->pod5 becomes non-reversible. As there is no way to get the range or digitisation values going backwards pod5->fast5. I know that isn't high on the priority list for moving forward, but from a scientific reproducibility point of view (and probably troubleshooting and growing pains), it is very important.

So it would be good if they could be included, or at least calculated from whatever is in pod5.

Cheers, James

0x55555555 commented 2 years ago

Hi @Psy-Fer ,

Thats really good info thanks, I will immediately add accessors so the values can be extracted - as you say with some potential loss.

I will also discuss with the team internally around which fields can be stored in the file, and I will refresh my knowledge on what minknow handles internally - if we internally use offset + scale, then using on disk then storing range on disk seems pointless.

Thanks,

0x55555555 commented 2 years ago

Hi @Psy-Fer ,

I hope 0.0.17 has resolved a lot of these issues - please let us know any feedback!

Psy-Fer commented 2 years ago

Hey George,

Yep that fixed it. Thanks for that.

Cheers, James

Psy-Fer commented 2 years ago

Hey George,

Just re-opening this one. Any chance we can get the same hooks you fixed up for python API also put into the C API?

So we can get the range and digitisation values using the C API ? (python ones are working great).

If I've missed something and they are there and I've just missed them, please let me know where I can find them. Otherwise, yea, could we get them added?

Cheers, James

0x55555555 commented 2 years ago

Hi James,

Good shout - I will get the C API updated with these values too

0x55555555 commented 2 years ago

Hi @Psy-Fer ,

There is a new API call available in 0.0.21 with these values present:

https://github.com/nanoporetech/pod5-file-format/blob/master/c++/pod5_format/c_api.h#L221

Thanks,

Psy-Fer commented 2 years ago

Thanks George, I'll give it a go and let you know. Appreciate the turnaround on this too.