hasindu2008 / slow5lib

slow5lib is a software library for reading & writing SLOW5 files.
https://hasindu2008.github.io/slow5lib
MIT License
41 stars 4 forks source link

Create fast5 #1

Closed hiruna72 closed 3 years ago

hiruna72 commented 3 years ago

The two important changes are line 242 and line 978 in f2s.c file. I have included CMake because having a cmake build helps to debug on Clion IDE. Don't worry about cmake.

hasindu2008 commented 3 years ago

@hiruna72 Great job! An important thing I would like to request is not to use global variables as we would have trouble multi-threading later if required. Instead, just pass all those to the recursive function as an argument struct.

We tested it on a tiny dataset and it segfaults - when I run through Valgrind, there seem to be some invalid memory operations. Could you look into it?

valgrind ./slow5tools f2s  data/raw/chr22_meth_example-subset-multi/ > a.slow5
==2402== Memcheck, a memory error detector
==2402== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==2402== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==2402== Command: ./slow5tools f2s data/raw/chr22_meth_example-subset-multi/
==2402==
==2402== error calling PR_SET_PTRACER, vgdb might block
[recurse_dir::0.009*55.48] Extracting fast5 from data/raw/chr22_meth_example-subset-multi/
[recurse_dir::0.033*16.01] Extracting fast5 from data/raw/chr22_meth_example-subset-multi//fast5_files
==2402== Invalid read of size 1
==2402==    at 0x4C30F62: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2402==    by 0x5D324FD: strdup (strdup.c:41)
==2402==    by 0x404110: op_func_attr(int, char const*, H5A_info_t const*, void*) (f2s.c:625)
==2402==    by 0x4E7C2CD: H5A_attr_iterate_table (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4F4B70B: H5O_attr_iterate_real (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4F4BE16: H5O_attr_iterate (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4E72A4E: H5Aiterate2 (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4077FB: op_func_group(int, char const*, H5L_info_t const*, void*) (f2s.c:948)
==2402==    by 0x4EFBE2B: ??? (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4F01A98: H5G__node_iterate (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4E80DB2: ??? (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4E82275: H5B_iterate (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==  Address 0x3ff0000000000000 is not stack'd, malloc'd or (recently) free'd
==2402==
[segv_handler::ERROR] I regret to inform that a segmentation fault occurred. But at least it is better than a wrong answer.
[DEBUG] src/main.c:segv_handler:54: Here is the backtrace:
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so(_vgr20070ZU_libcZdsoZa_strlen+0x2)[0x4c30f62]
/lib/x86_64-linux-gnu/libc.so.6(__strdup+0xe)[0x5d324fe]
./slow5tools(_Z12op_func_attriPKcPK10H5A_info_tPv+0x251)[0x404111]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5A_attr_iterate_table+0x1ce)[0x4e7c2ce]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5O_attr_iterate_real+0x1dc)[0x4f4b70c]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5O_attr_iterate+0x77)[0x4f4be17]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5Aiterate2+0x1ff)[0x4e72a4f]
./slow5tools(_Z13op_func_groupiPKcPK10H5L_info_tPv+0x1cc)[0x4077fc]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(+0xc1e2c)[0x4efbe2c]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5G__node_iterate+0xe9)[0x4f01a99]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(+0x46db3)[0x4e80db3]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5B_iterate+0x6)[0x4e82276]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5G__stab_iterate+0xcf)[0x4f0729f]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5G__obj_iterate+0x110)[0x4f043e0]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5G_iterate+0x108)[0x4efce38]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10(H5Literate+0x21f)[0x4f3138f]
./slow5tools(_Z14fast5_to_slow5PKcP8_IO_FILE9FormatOutP10z_stream_sS2_+0x956)[0x405c86]
./slow5tools(_Z11recurse_dirPKcP8_IO_FILE9FormatOutP10z_stream_sS2_+0x268)[0x406818]
./slow5tools(_Z11recurse_dirPKcP8_IO_FILE9FormatOutP10z_stream_sS2_+0x1e5)[0x406795]
./slow5tools(_Z11recurse_dirPKcP8_IO_FILE9FormatOutP10z_stream_sS2_+0x1e5)[0x406795]
./slow5tools(_Z8f2s_mainiPPcP12program_meta+0x66f)[0x406ecf]
./slow5tools(main+0x7c4)[0x403c04]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x5cc7840]
./slow5tools(_start+0x29)[0x403cd9]

==2402== Invalid read of size 4
==2402==    at 0x4ECE21C: H5F_close (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4F2C26A: ??? (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4FAC8E9: H5SL_try_free_safe (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4F2CB9B: H5I_clear_type (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4EC798F: H5F_term_interface (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4E6620D: H5_term_library (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x5CE1007: __run_exit_handlers (exit.c:82)
==2402==    by 0x5CE1054: exit (exit.c:104)
==2402==    by 0x403E5B: segv_handler(int) (main.c:61)
==2402==    by 0x5CDC4BF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==2402==    by 0x4C30F61: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2402==    by 0x5D324FD: strdup (strdup.c:41)
==2402==  Address 0x52c is not stack'd, malloc'd or (recently) free'd
==2402==
==2402==
==2402== Process terminating with default action of signal 11 (SIGSEGV)
==2402==  Access not within mapped region at address 0x52C
==2402==    at 0x4ECE21C: H5F_close (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4F2C26A: ??? (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4FAC8E9: H5SL_try_free_safe (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4F2CB9B: H5I_clear_type (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4EC798F: H5F_term_interface (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x4E6620D: H5_term_library (in /usr/lib/x86_64-linux-gnu/libhdf5_serial.so.10.1.0)
==2402==    by 0x5CE1007: __run_exit_handlers (exit.c:82)
==2402==    by 0x5CE1054: exit (exit.c:104)
==2402==    by 0x403E5B: segv_handler(int) (main.c:61)
==2402==    by 0x5CDC4BF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==2402==    by 0x4C30F61: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2402==    by 0x5D324FD: strdup (strdup.c:41)
==2402==  If you believe this happened as a result of a stack
==2402==  overflow in your program's main thread (unlikely but
==2402==  possible), you can try to increase the size of the
==2402==  main thread stack using the --main-stacksize= flag.
==2402==  The main thread stack size used in this run was 8388608.
==2402==
==2402== HEAP SUMMARY:
==2402==     in use at exit: 834,934 bytes in 2,515 blocks
==2402==   total heap usage: 2,989 allocs, 474 frees, 980,701 bytes allocated
==2402==
==2402== LEAK SUMMARY:
==2402==    definitely lost: 0 bytes in 0 blocks
==2402==    indirectly lost: 0 bytes in 0 blocks
==2402==      possibly lost: 0 bytes in 0 blocks
==2402==    still reachable: 834,934 bytes in 2,515 blocks
==2402==                       of which reachable via heuristic:
==2402==                         length64           : 2,080 bytes in 8 blocks
==2402==         suppressed: 0 bytes in 0 blocks
==2402== Rerun with --leak-check=full to see details of leaked memory
==2402==
==2402== For counts of detected and suppressed errors, rerun with: -v
==2402== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)
hiruna72 commented 3 years ago

image

The fast5 files you have used have a strange attribute called file_version under each read group that has a float datatype (left). This is not observed in other files (right)

h5dump -H expected_1.fast5 produced the following.

HDF5 "expected_1.fast5" { GROUP "/" { ATTRIBUTE "file_version" { DATATYPE H5T_STRING { STRSIZE H5T_VARIABLE; STRPAD H5T_STR_NULLTERM; CSET H5T_CSET_UTF8; CTYPE H5T_C_S1; } DATASPACE SCALAR } GROUP "read_75d7303c-726a-407f-8df6-59e98ef86e34" { ATTRIBUTE "file_version" { DATATYPE H5T_IEEE_F64LE DATASPACE SCALAR } GROUP "Analyses" {......

hiruna72 commented 3 years ago

Valgrind memory test passed. As a temporary patch, I check datatype before setting variables., e.g., whether file_version attribute is string type. If this new float based file_version is common in fast5 files we can fix it later.

hasindu2008 commented 3 years ago

This is cool!