hasindu2008 / slow5lib

slow5lib is a software library for reading & writing SLOW5 files.
https://hasindu2008.github.io/slow5lib
MIT License
41 stars 4 forks source link

Error in write_record() function when initializing aux_meta fields #70

Closed KavinduJayas closed 1 year ago

KavinduJayas commented 1 year ago

Issue Description: I encountered an error while using the write_record() function in the pyslow5 library. The error occurs when attempting to initialize the aux_meta fields. The error message indicates that the initialization of several aux_meta fields failed for each record, leading to the inability to set them in the C s5.header.aux_meta struct. This error persists for all aux fields.

Steps to Reproduce:

Get the dataset: wget https://slow5.page.link/hg2_prom_subsub

Untar: tar -xf /hg2_prom_subsub

Run the code:

import pyslow5

#open two files for reading and writing 
hg2_original = pyslow5.Open('/hg2_prom_lsk114_subsubsample/reads.blow5','r')
hg2_modified = pyslow5.Open('/hg2_prom_lsk114_subsubsample/reads_modified.blow5','w')

reads = hg2_original.seq_reads(aux='all')

# For each read in s5_read...
for read in reads:
    # get an empty record and aux dictionary
    record, aux = hg2_original.get_empty_record(aux=True)
    # for each field in read...
    for i in read:
        # if the field is in the record dictionary...
        if i in record:
            # copy the value over...
            record[i] = read[i]
        #do same for aux dictionary
        if i in aux:
            aux[i] = read[i]
    # write the record
    ret = hg2_modified.write_record(record, aux)
    print("ret: write_record(): {}".format(ret))

Error message:

ERROR:pyslow5:write_record: aux_meta fields failed to initialise
ERROR:pyslow5:write_record: slow5_aux_add channel_number: 2511 could not set to C s5.header.aux_meta struct
ERROR:pyslow5:write_record: slow5_aux_add median_before: 196.59095764160156 could not set to C s5.header.aux_meta struct
ERROR:pyslow5:write_record: slow5_aux_add read_number: 18679 could not set to C s5.header.aux_meta struct
ERROR:pyslow5:write_record: slow5_aux_add start_mux: 2 could not set to C s5.header.aux_meta struct
ERROR:pyslow5:write_record: slow5_aux_add start_time: 86691581 could not set to C s5.header.aux_meta struct
ERROR:pyslow5:write_record: slow5_aux_add_enum end_reason: 5 could not set to C s5.header.aux_meta struct

I would greatly appreciate any guidance or solution to resolve this issue. Thank you in advance for your assistance!

Psy-Fer commented 1 year ago

Hey,

Thanks for the detailed information.

Could you give me the python version and pyslow5 version you are using?

I'll have a look in the morning to see if I can reproduce this, but any extra info would be very helpful.

Cheers, James.

KavinduJayas commented 1 year ago

Hey James,

Thanks for the quick response.

I am using a Google Colab notebook, the python version is 3.10.12 and pyslow5 version is pyslow5-1.0.0.

If you need more details please let me know.

Best regards, Kavindu.

Psy-Fer commented 1 year ago

Hey,

Okay that's helpful. I'll be sure to also do some tests in google colab.

I'll get back to you soon.

James

Psy-Fer commented 1 year ago

Hello,

Ahh so I see the problem.

There are actually 2 problems.

  1. Your code/example is missing the header writing step. This is important because while it doesn't actually write anything when you run it, it writes it on the first record write. This is so I can do a number of type checks and other things before actually writing data. I'll add an updated example for you below.

  2. ONT have added a 6th end_reason to their data. data_service_unblock_mux_change is a new value. While we dynamically catch this, and you can write this with python, my helper function to get the values, is hard coded with only 5 values. This was to help those writing files from scratch. I think my favourite part of this, is they didn't add it to the end of the end_reason list, they added it to the middle. So the signal_positive and signal_negative values are now wrong based on integer order for all other file versions. Another reason we write slow5 was to get away from the insanity that is ONT file scheme handling. You can just set your own end_reason data labels, so not too bad.

This has given me some things to sort out in the python library, so thank you very much for posting the issue.

The code example below should solve your problems as things are now.

Cheers, James

import pyslow5
import numpy as np

F = pyslow5.Open('reads.blow5','r', DEBUG=1)
W = pyslow5.Open('modified_reads.blow5','w', DEBUG=1)

header, end_reason_labels = F.get_empty_header(aux=True)
header_original = F.get_all_headers()

new_end_reason_labels = ['unknown', 'partial', 'mux_change', 'unblock_mux_change', 'data_service_unblock_mux_change', 'signal_positive', 'signal_negative']

for i in header_original:
    if i in header:
        if header_original[i] is None:
            continue
        else:
            header[i] = header_original[i]

ret = W.write_header(header, end_reason_labels=new_end_reason_labels)
print("ret: write_header(): {}".format(ret))

reads = F.seq_reads(aux='all')

records = {}
auxs = {}
for read in reads:
    record, aux = F.get_empty_record(aux=True)
    # record = F.get_empty_record()
    for i in read:
        if i == "read_id":
            readID = read[i]
        if i in record:
            record[i] = read[i]
        if i in aux:
            aux[i] = read[i]
    records[readID] = record
    auxs[readID] = aux
print(records)
print(auxs)
ret = W.write_record_batch(records, threads=8, batchsize=200, aux=auxs)
print("ret: write_record(): {}".format(ret))

F.close()
W.close()
KavinduJayas commented 1 year ago

Hey James,

I tested the revised code you provided, and I'm happy to inform you that it is now working correctly. I also appreciate the detailed explanation, thank you for your assistance.

Best regards, Kavindu