hasindu2008 / slow5tools

Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.
https://hasindu2008.github.io/slow5tools
MIT License
90 stars 6 forks source link

Could not create read group #108

Closed maximilianmordig closed 4 months ago

maximilianmordig commented 4 months ago

I am writing a blow5 file using the slow5 C API (using slow5_open). When I try to convert it with

slow5tools s2f -p "$(nproc)" -o reads1_2.fast5 reads1.blow5
[s2f_main] 1 files found - took 0.000s
[s2f_main::INFO] No. of input files (1) < no. of processes (48). For faster parallel conversion, consider splitting your input into multiple files using slow5tools split.
[s2f_main] Just before forking, peak RAM = 0.000 GB
[s2f_iop] 1 proceses will be used
[write_fast5::ERROR] Could not create read group in fast5 file 'reads1_2.fast5'.

If I instead write the file using slow5 (simply changing the name of the file), the conversion works:

slow5tools s2f -p "$(nproc)" -o reads1_1.fast5 reads1.slow5
[s2f_main] 1 files found - took 0.000s
[s2f_main::INFO] No. of input files (1) < no. of processes (48). For faster parallel conversion, consider splitting your input into multiple files using slow5tools split.
[s2f_main] Just before forking, peak RAM = 0.000 GB
[s2f_iop] 1 proceses will be used
[s2f_main] Converting 1 s/blow5 files took 0.050s
[s2f_main] Children processes: CPU time = 0.000 sec | peak RAM = 0.000 GB

[main] cmd: slow5tools s2f -p 48 -o reads1_1.fast5 reads1.slow5
[main] real time = 0.051 sec | CPU time = 0.055 sec | peak RAM = 0.009 GB

I have attached both a blow5 file and a slow5 file (with different contents), the former cannot be converted. reads.zip Is there some compression flag I need to set when writing blow5?

slow5tools --version
slow5tools 1.1.0

[main] cmd: slow5tools --version
[main] real time = 0.000 sec | CPU time = 0.004 sec | peak RAM = 0.006 GB
hasindu2008 commented 4 months ago

Had a look, and the readIDs in the .blow5 file are empty.

slow5tools skim reads1.blow5
#read_id        read_group      digitisation    offset  range   sampling_rate   len_raw_signal  raw_signal      channel_number  median_before   read_number     start_mux       start_time
        0       8192    13.722261       1443.030273     4000    12923   .       1       0.1     1       1       100
        0       8192    13.722261       1443.030273     4000    17304   .       1       0.1     2       1       100
        0       8192    13.722261       1443.030273     4000    20421   .       1       0.1     3       1       100
        0       8192    13.722261       1443.030273     4000    20422   .       1       0.1     4       1       100
        0       8192    13.722261       1443.030273     4000    20422   .       1       0.1     5       1       100
        0       8192    13.722261       1443.030273     4000    20422   .       1       0.1     6       1       100
        0       8192    13.722261       1443.030273     4000    20423   .       1       0.1     7       1       100
        0       8192    13.722261       1443.030273     4000    20423   .       1       0.1     8       1       100
        0       8192    13.722261       1443.030273     4000    6700    .       1       0.1     9       1       100
        0       8192    13.722261       1443.030273     4000    20424   .       1       0.1     10      1       100
        0       8192    13.722261       1443.030273     4000    2320    .       1       0.1     11      1       100
        0       8192    13.722261       1443.030273     4000    20424   .       1       0.1     12      1       100
        0       8192    13.722261       1443.030273     4000    20424   .       1       0.1     13      1       100
        0       8192    13.722261       1443.030273     4000    20424   .       1       0.1     14      1       100
        0       8192    13.722261       1443.030273     4000    20425   .       1       0.1     15      1       100
        0       8192    13.722261       1443.030273     4000    20425   .       1       0.1     16      1       100
        0       8192    13.722261       1443.030273     4000    20425   .       1       0.1     17      1       100
        0       8192    13.722261       1443.030273     4000    20425   .       1       0.1     18      1       100
        0       8192    13.722261       1443.030273     4000    20426   .       1       0.1     19      1       100
        0       8192    13.722261       1443.030273     4000    20426   .       1       0.1     20      1       100
        0       8192    13.722261       1443.030273     4000    20426   .       1       0.1     21      1       100
        0       8192    13.722261       1443.030273     4000    20426   .       1       0.1     22      1       100

Trying to index, it gives this.

slow5tools index reads1.blow5
[slow5_idx_insert::ERROR] Read ID '' is duplicated At src/slow5_idx.c:495
[slow5_idx_build::ERROR] Inserting '' to index failed At src/slow5_idx.c:335
Error running slow5idx_build on reads1.blow5

Do you know how this file was generated?

maximilianmordig commented 4 months ago

I generated the file myself. I have attached a condensed version compiled using C++20. The slow5 file has the read id set, but the blow5 has not.

slow5_blow5_missing_read_id.cpp.zip

hasindu2008 commented 4 months ago

Oh I see, you should set the read_id_len like this https://github.com/hasindu2008/slow5lib/blob/e0d0d0f3da18374519b60924850c9e7f900a6bb3/examples/write.c#L163.

maximilianmordig commented 4 months ago

Of course, thanks. Somehow it worked with slow5, but not blow5. It may be worth to check for read_id_len>0 before writing. That is a bit inconsistent with slow5_rec_set_string which detects the length automatically with strlen.