tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
700 stars 281 forks source link

tf.data.IODataset.from_parquet crashes with Segmentation fault #1781

Open ToufiPF opened 1 year ago

ToufiPF commented 1 year ago

Hello, When reading a parquet file with specifed dtypes and a column full of NaNs, tf.data.IODataset.from_parquet crashes. I'm guessing this is somehow related to how data types are deduced or how NaNs are interpreted (but note that I'm providing the TensorSpecs manually).

Reproducible example

import tensorflow as tf
import tensorflow_io as tfio
import time
import pandas as pd
import numpy as np

if __name__ == '__main__':
    # features in the .parquet file
    feats = ['speed', 'pressure']
    # crashing column (full NaN)
    open_columns = ['speed']
    fname = 'crash.parquet'

    # Generate the dataset, 'speed' is a column full of NaN, and 'pressure' a column full of valid floats.
    nb_rows = 10
    df = pd.DataFrame()
    df['speed'] = np.full(nb_rows, np.NaN, dtype=np.float32)
    df['pressure'] = np.random.uniform(0, 10, size=nb_rows).astype(np.float32)
    df.to_parquet(fname)

    df = pd.read_parquet(fname, columns=open_columns)
    df.info(verbose=True)  # pandas works fine
    for i, row in df.iterrows():
        pass
    print('Pandas OK')

    # changes the crash type, but crashes in any case
    # tf.data.experimental.enable_debug_mode()

    features = {f: tf.TensorSpec(shape=tf.TensorShape([]), dtype=tf.float32) for f in open_columns}
    ds = tfio.IODataset.from_parquet(fname, columns=features)
    # transform the ordered dict (with D named columns) to 1 tensor of shape (None,D)
    ds = ds.map(lambda dictionary: tf.stack(list(dictionary.values()), axis=0))

    for x in ds:
        time.sleep(0.01)

Stacktraces

Stacktraces obtained by running with python -X dev crash.py. I removed the stuff about CUDA, which I doubt is related. There are different stacktraces depending on whether tf.data.experimental.enable_debug_mode is called.

Stacktrace w/ tf.data.experimental.enable_debug_mode()

/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/__init__.py:29: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
  import distutils as _distutils
2023-03-14 12:16:24.723564: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   speed   0 non-null      float32
dtypes: float32(1)
memory usage: 168.0 bytes
Pandas OK
2023-03-14 12:16:27.894764: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
2023-03-14 12:16:28.075214: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
2023-03-14 12:16:28.075431: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
Fatal Python error: Segmentation fault

Current thread 0x00007f3e39ffb640 (most recent call first):
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52 in quick_execute
  File "<string>", line 11138 in io_parquet_readable_read_eager_fallback
  File "<string>", line 11067 in io_parquet_readable_read
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow_io/python/ops/parquet_dataset_ops.py", line 93 in f
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/data/ops/structured_function.py", line 212 in py_function_wrapper
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 642 in wrapper
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 154 in _call
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 147 in __call__
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 269 in __call__

Thread 0x00007f3f31fef000 (most recent call first):
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3012 in iterator_get_next
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 770 in _next_internal
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 787 in __next__
  File "/mnt/c/Programming/ml-training/./src/reproduce_crash.py", line 51 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, google.protobuf.pyext._message, tensorflow.python.framework.fast_tensor_util, charset_normalizer.md, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.h5r, h5py.utils, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5t, h5py._conv, h5py.h5z, h5py._proxy, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, PIL._imaging, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pyarrow.lib, pyarrow._hdfsio, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, scipy.ndimage._nd_image, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, _ni_label, scipy.ndimage._ni_label, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pyarrow._csv, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._dataset_parquet (total: 126)
Segmentation fault

W/o tf.data.experimental.enable_debug_mode()

[...] same stuff

terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Unexpected end of stream
Fatal Python error: Aborted

Thread 0x00007efc57514000 (most recent call first):
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3012 in iterator_get_next
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 770 in _next_internal
  File "/home/clea/.venv310/lib/python3.10/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 787 in __next__
  File "/mnt/c/Programming/ml-training/./src/reproduce_crash.py", line 51 in <module>

[...] same stuff

Environment

requirements.txt

numpy~=1.23.5
pandas~=1.5.3
pyarrow~=11.0.0
tensorflow-io[tensorflow]~=0.31.0

This is similar to #1667, except there shouldn't be any type mixup in this scenario.

ToufiPF commented 1 year ago

Leaving this here for anyone having the same issue.

The workaround I found was to replace the NaNs with a magic number (e.g., -infinity) in the data using pandas. Then at loading time, I use Dataset.map to map back the -infinity to NaNs.