hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
94 stars 6 forks source link

slow5tools degrade (1.3.0) does not detect ULK kit? #134

Open jelber2 opened 3 days ago

jelber2 commented 3 days ago

For the following s/blow5 header made with blue-crab (0.1.2) , it does not seem that slow5tools degrade (1.3.0) recognizes the ULK kit.

#slow5_version  0.2.0
#num_read_groups        1
@acquisition_id ca82937006c473b34e065122cf6a8ed73c55ce18
@acquisition_start_time 2024-06-26 09:25:49.033000+00:00
@adc_max        2047
@adc_min        0
@asic_id        FFFFFC0FE73734C0
@asic_id_eeprom FFFFFC0FE73734C0
@asic_temp      28.228447
@asic_version   Unknown
@barcoding_enabled      0
@basecall_config_filename       dna_r10.4.1_e8.2_400bps_5khz_modbases_5hmc_5mc_cg_hac_prom.cfg
@configuration_version  5.9.18
@data_source    real_device
@device_id      A
@device_type    p2_solo
@distribution_status    stable
@distribution_version   24.02.16
@exp_script_name        sequencing/sequencing_PRO114_DNA_e8_2_400K_long_read:FLO-PRO114M:SQK-ULK114:400
@exp_script_purpose     sequencing_run
@exp_start_time 2024-06-26T11:25:49.033544+02:00
@experiment_name        Blood-WGS_ONT_24062024
@experiment_type        genomic_dna
@flow_cell_id   PAU99561
@flow_cell_product_code FLO-PRO114M
@fpga_board_id  0018f5206e51685c
@fpga_firmware_version  2.1.0
@guppy_version  7.3.11+0112dde09
@heatsink_temp  34.045727
@host_product_code      GRD-MK1
@host_product_serial_number     GXB04189
@hostname       GXB04189
@installation_type      nc
@is_simulated   0
@local_basecalling      1
@operating_system       ubuntu 20.04
@package        bream4
@package_version        7.9.8
@protocol_group_id      Blood-WGS_ONT_24062024
@protocol_name  sequencing/sequencing_PRO114_DNA_e8_2_400K_long_read:FLO-PRO114M:SQK-ULK114:400
@protocol_run_id        d2c3e09e-da67-4bba-aecf-0c004874a607
@protocol_start_time    2024-06-26T11:24:03.627544+02:00
@protocols_version      7.9.8
@run_id ca82937006c473b34e065122cf6a8ed73c55ce18
@sample_frequency       5000
@sample_id      Blood-WGS_L3_26062024
@sample_rate    5000
@selected_speed_bases_per_second        400
@sequencer_hardware_revision    HW-30
@sequencer_position     P2S-00581-A
@sequencer_position_type        PromethION
@sequencer_product_code PRO-SEQ002
@sequencer_serial_number        P2S-00581
@sequencing_kit sqk-ulk114
@software       MinKNOW 24.02.16 (Bream 7.9.8, Core 5.9.12, Dorado 7.3.11+0112dde09)
@system_name    GXB04189
@system_type    GridION Mk1
@usb_config     fx3_0.0.0#fpga_0.0.0#unknown#unknown
@usb_firmware_version   2.5.1
@version        5.9.12
~/bin/slow5tools-v1.3.0/slow5tools degrade -s ex-zd -c zstd PAU99561_d2c3e09e_ca829370_21.blow5 -o PAU99561_d2c3e09e_ca829370_21.3.blow5

[degrade_main::WARNING] This tool performs lossy compression which is an irreversible operation. Just making sure it is intended. 
[slow5_hdr_get_dataset] Not detected: MinION DNA lsk114 5kHz
[slow5_hdr_get_dataset] Not detected: PromethION DNA lsk109 4kHz
[slow5_hdr_get_dataset] Not detected: PromethION DNA lsk114 4kHz
[slow5_hdr_get_dataset] Not detected: PromethION DNA lsk114 5kHz
[slow5_hdr_get_dataset] Not detected: PromethION RNA rna002 3kHz
[slow5_hdr_get_dataset] Not detected: PromethION RNA rna004 4kHz
[slow5_hdr_get_dataset::ERROR] No suitable bits suggestion
[degrade_main::ERROR] Use option -b to manually specify
~/bin/slow5tools-v1.3.0/slow5tools degrade -s ex-zd -c zstd PAU99561_d2c3e09e_ca829370_21.blow5 -o PAU99561_d2c3e09e_ca829370_21.3.blow5 -b4
[degrade_main::WARNING] This tool performs lossy compression which is an irreversible operation. Just making sure it is intended. 
[slow5_encode_signal_press::WARNING] Signal compression method ex-zd is new. While it is stable, just keep an eye. At src/slow5_press.c:116

[main] cmd: /home/jelber43/bin/slow5tools-v1.3.0/slow5tools degrade -s ex-zd -c zstd PAU99561_d2c3e09e_ca829370_21.blow5 -o PAU99561_d2c3e09e_ca829370_21.3.blow5 -b4
[main] real time = 40.577 sec | CPU time = 117.731 sec | peak RAM = 3.700 GB

I guess if it is possible to parse the ULK part, then that would be fine or to show the user what bit values to use for different datasets?

hasindu2008 commented 2 days ago

Hello, we are parsing the ulk part properly, but it is checking if the kits match the ones we exhaustively tested. As this is a lossy compression, we are being very pedantic to avoid a user from inadvertently getting their data affected. These kits will be eventually added when we come across them and test. I have not had access to GridION sqk-ulk114 data, but is very likely the suitable -b would be 3. Is this a publicly available dataset?

jelber2 commented 12 hours ago

As per the Twitter conversation (https://x.com/jpelbers/status/1842484817885073502), here is a Dropbox link to ~30x average coverage ONT ULK reads for HG002 chr22 (based on alignment to hg38 no alts). They were HG002 cells with DNA extracted following a BioNano DNA extraction protocol, undergoing ONT ULK library preparation, then sequenced on an ONT PromethION P2 solo device with an r10.4.1 flowcell connected to a ONT GridION for data acquisition. Provided is an ex-zd, zstd blow5 file that you can access with

wget 'https://www.dropbox.com/scl/fi/8s0p4ttpuy1amiuulzu3v/WGS_HG002_Bionano_recover_13022024.chr22.readids.blow5?rlkey=395acerl9ewgyqkafi7g15ipe&st=giubcawn' -O WGS_HG002_Bionano_recover_13022024.chr22.blow5

on a computer with wget.

Best, Jean Elbers

*NOTE that the blow5 file on Dropbox does not match the header above in this Github issue as I realized those squiggles did not belong to HG002.