Segmentation Fault when reading multiple fast5 files

BlkPingu commented 1 year ago

To assist reproducing bugs, please include the following:

Operating System: macOS 13.4.1
Python version 3.11.5
Where Python was acquired: Homebrew
h5py version 3.9.0
HDF5 version 1.12.2
The full traceback/stack trace shown: below

Crash report:

Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib                 0x1a41e8724 __pthread_kill + 8
1   libsystem_pthread.dylib                0x1a421fc28 pthread_kill + 288
2   libsystem_c.dylib                      0x1a40f646c raise + 32
3   Python                                 0x10114f8b0 faulthandler_fatal_error + 440
4   libsystem_platform.dylib               0x1a424ea24 _sigtramp + 56
5   libvbz_hdf_plugin_m1.dylib             0x126905744 StreamVByteWorkerV0<short, true>::decompress(gsl::span<char const>, gsl::span<char>) + 180
6   libvbz_hdf_plugin_m1.dylib             0x1269070f8 vbz_decompress + 340
7   libvbz_hdf_plugin_m1.dylib             0x126903fbc vbz_filter(unsigned int, unsigned long, unsigned int const*, unsigned long, unsigned long*, void**) + 748
8   libhdf5.200.dylib                      0x1020d1438 H5Z_pipeline + 508
9   libhdf5.200.dylib                      0x101e670c8 H5D__chunk_lock + 884
10  libhdf5.200.dylib                      0x101e622ac H5D__chunk_read + 780
11  libhdf5.200.dylib                      0x101e855b8 H5D__read + 1224
12  libhdf5.200.dylib                      0x1020c2f58 H5VL__native_dataset_read + 116
13  libhdf5.200.dylib                      0x1020ab1f0 H5VL_dataset_read + 180
14  libhdf5.200.dylib                      0x101e8454c H5Dread + 744
15  defs.cpython-311-darwin.so             0x102d6bfbc __pyx_f_4h5py_4defs_H5Dread + 96
16  _selector.cpython-311-darwin.so        0x10372cfe4 __pyx_pw_4h5py_9_selector_6Reader_3read + 336
17  Python                                 0x1010e988c _PyEval_EvalFrameDefault + 46824
18  Python                                 0x1010ec72c _PyEval_Vector + 116
19  _objects.cpython-311-darwin.so         0x102d392f8 __pyx_pw_4h5py_8_objects_9with_phil_1wrapper + 564
20  Python                                 0x10100fe34 _PyObject_MakeTpCall + 128
21  Python                                 0x1010133bc method_vectorcall + 564
22  Python                                 0x1010798c8 vectorcall_method + 128
23  Python                                 0x101078d00 slot_mp_subscript + 52
24  Python                                 0x1010dfa34 _PyEval_EvalFrameDefault + 6288
25  Python                                 0x1010dd660 PyEval_EvalCode + 168
26  Python                                 0x1011300ec run_eval_code_obj + 84
27  Python                                 0x101130050 run_mod + 112
28  Python                                 0x10112fe90 pyrun_file + 148
29  Python                                 0x10112f8e4 _PyRun_SimpleFileObject + 268
30  Python                                 0x10112f274 _PyRun_AnyFileObject + 216
31  Python                                 0x10114b16c pymain_run_file_obj + 220
32  Python                                 0x10114aaac pymain_run_file + 72
33  Python                                 0x10114a38c Py_RunMain + 704
34  Python                                 0x10114b4c4 Py_BytesMain + 40
35  dyld                                   0x1a3ec7f28 start + 2236

Faulthandler:

Current thread 0x00000001ff1f9e00 (most recent call first):
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/.venv/lib/python3.11/site-packages/h5py/_hl/dataset.py", line 758 in __getitem__
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/.venv/lib/python3.11/site-packages/ont_fast5_api/fast5_read.py", line 527 in _load_raw
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/.venv/lib/python3.11/site-packages/ont_fast5_api/fast5_read.py", line 161 in get_raw_data
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 17 in raw_data_to_numpy_array
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 54 in combine_files
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 62 in iter_run
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 74 in run_all
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 83 in main
  File "/Users/Tobias/Documents/Studium/Master_AI/Semester/Masterarbeit/Masterarbeit_Versuch_2/Code/testcode/barcode_to_parquet.py", line 86 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.h5r, h5py.utils, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5t, h5py._conv, h5py.h5z, h5py._proxy, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.tslib, pandas._libs.ops, numexpr.interpreter, pyarrow._compute, pandas._libs.arrays, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs (total: 87)
[1]    88801 segmentation fault  python barcode_to_parquet.py

Script:

from ont_fast5_api.fast5_interface import get_fast5_file
import numpy as np
import h5py
import pandas as pd
import os
import faulthandler

def raw_data_to_numpy_array(readfile):
    data = []

    fast5_filepath = readfile # This can be a single- or multi-read file
    with get_fast5_file(fast5_filepath, mode="r") as f5:
        for read in f5.get_reads():
            raw_data = read.get_raw_data()
            tup = read.read_id, raw_data
            data.append(tup)
    f5.close()
            # read.read_id, raw_data.shape

    return pd.DataFrame(data, columns=['read_id', 'raw_data'])

def get_paths(source_dir, iter):
   barcode_iter = 'barcode' + iter
   barcode_pass = os.path.join(source_dir, 'fast5_pass', barcode_iter)
   barcode_fail = os.path.join(source_dir, 'fast5_fail', barcode_iter)
   if os.path.exists(barcode_pass) and os.path.exists(barcode_fail):
        print(barcode_pass)
        print(barcode_fail)

        return [barcode_pass, barcode_fail]
   else:
        print('No barcodes found')

def walk_through_files(path, file_extension='.fast5'):
   for (dirpath, dirnames, filenames) in os.walk(path):
      for filename in filenames:
         if filename.endswith(file_extension):
            yield os.path.join(dirpath, filename)

def combine_files(source_dir, iterator):
    barcode_list = get_paths(source_dir, iterator)
    if barcode_list is None:
        print('File not found')
        return None
    barcode_parts = []
    for barcode in barcode_list:
        for fname in walk_through_files(barcode):
            barcode_part = raw_data_to_numpy_array(fname)
            barcode_parts.append(barcode_part)

    return pd.concat(barcode_parts)

def iter_run(run_path, run_id):
    for iterator in ["{0:02}".format(i) for i in range(1,100)]:
        df = combine_files(run_path, iterator)
        print()
        df.to_parquet(run_id + '_barcode_' + iterator + '.parquet', engine='pyarrow', compression='snappy')
        df = None

def run_all(source_dir, run_id_list):
    dirs = os.listdir(source_dir)
    print(dirs)
    for run_id in run_id_list:
        print(run_id)
        run_folder_name = [d for d in dirs if d.endswith(run_id)]
        print(run_folder_name[0])
        run_path = os.path.join(source_dir, run_folder_name[0])
        iter_run(run_path, run_id)

test_path = 'Path/to/data/MinION_sample_data/RawData/RKI/MinION_fast5/unpacked/'
prod_path = 'Path/to/data/extracted/'

run_id_list = ['024', '084', '085', '086', '107', '123', '135']

def main():
    faulthandler.enable()
    run_all(source_dir=prod_path, run_id_list=run_id_list)

if __name__ == '__main__':
    main()

Basically, when reading multiple fast5 files my script seqfaults. I have no idea why, but faulthandler points me in the direction of dataset.py. The crash report indicates the segfault occurred in the library's plugin for VBZ compression libvbz_hdf_plugin_m1.dylib.

Scripts purpose is to combine multiple barcodes data, each with pass and fails, into one parquet file. For some reason it seqfaults after writing a few GB worth of data. Please don't roast me for the terrible code quality. It's just to process some data into a different random access format.

Any advice?

0x55555555 commented 1 year ago

Hi @BlkPingu ,

Have you tried on multiple input datasets, or only one?

I'll give it a go now with some input data I have locally.

George

0x55555555 commented 1 year ago

Right @BlkPingu ,

It does all seem to work as expected on my end, I converted ~10GB of data through the script and it didnt crash.

Is it possible there is a specific part of the input datasets that is corrupted?

George

BlkPingu commented 1 year ago

Hello George, that is wild. Did you run the script using a macOS device or something else? It could very well be that the data is corrupted, at least partially. Thanks for th suggestion.

0x55555555 commented 1 year ago

Hi @BlkPingu ,

It was on an Apple M1 Max with 32 GB of memory. I did notice the script using > 40GB of memory while running - which was quite exciting.

If you find it reproduces specifically with one file I could have a look at the file?

George

nanoporetech / vbz_compression

Segmentation Fault when reading multiple fast5 files #27