tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
704 stars 284 forks source link

"InvalidArgumentError: null value in column" when loading Parquet into tfio.IODataset when column has no actual null values #1254

Closed dgoldenberg-audiomack closed 3 years ago

dgoldenberg-audiomack commented 3 years ago

Loading movielens data from Parquet I pre-generated, into an IODataset, yields error as below (stack included).

I'm attaching sample parquet files. I do not see any records where the movie_title column value would be null.

Code is also attached. The logic flow is:

  1. Load the CSV movielens data from AWS S3.
  2. Convert this data to Parquet and store Parquet in S3. This generates the Parquet we can test with.
  3. Download the Parquet files from S3 to the local temp dir. Have to do this as a workaround for #1252 (can't stream Parquet directly into TF IO to load a dataset).
  4. Load a TF IO dataset from the Parquet data.
  5. Observe the error tensorflow.python.framework.errors_impl.InvalidArgumentError: null value in column: movie_title.
  6. Note: I tried Parquet with the compression option (snappy) and without, same error.

Check either of the attached .parquet files, esp. the ratings file, e.g. with parquet-tools cat --json part-00000-b1706a65-3a09-4e0b-a7fb-33e5144d046c-c000.snappy.parquet. No null/empty movie_title values.

The error appears to be coming from ./tensorflow_io/core/kernels/parquet_kernels.cc.

Trace:

2020-12-30 20:33:28.965622: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at io_kernel.h:78 : Invalid argument: null value in column: movie_title Traceback (most recent call last): File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2113, in execution_mode yield File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 733, in _next_internal output_shapes=self._flat_output_shapes) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2579, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6862, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: null value in column: movie_title [[{{node IO>ParquetReadableRead}}]] [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/tmp/spark-e0c9234e-cfad-460a-8c83-1d4f2b86aef8/my_app.py", line 367, in main(sys.argv) File "/mnt/tmp/spark-e0c9234e-cfad-460a-8c83-1d4f2b86aef8/my_app.py", line 81, in main movies_ds, test, train, unique_movie_titles, unique_user_ids = prepare_data(movies_ds, ratings_ds) File "/mnt/tmp/spark-e0c9234e-cfad-460a-8c83-1d4f2b86aef8/my_app.py", line 197, in prepare_data print(">> SIZE(ratings_no_titles_ds) = " + str(ds_size(ratings_no_titles_ds))) File "/mnt/tmp/spark-e0c9234e-cfad-460a-8c83-1d4f2b86aef8/my_app.py", line 176, in ds_size for elem in ds: File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 747, in next return self._next_internal() File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 739, in _next_internal return structure.from_compatible_tensor_list(self._element_spec, ret) File "/usr/lib64/python3.7/contextlib.py", line 130, in exit self.gen.throw(type, value, traceback) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2116, in execution_mode executor_new.wait() File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 69, in wait pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle) tensorflow.python.framework.errors_impl.InvalidArgumentError: null value in column: movie_title [[{{node IO>ParquetReadableRead}}]]

tf_io_issue_parquet_to_ds_null_column.zip

dgoldenberg-audiomack commented 3 years ago

Tried with the column list as a list of just names, same error.

    # ratings_columns = {
    #     "user_id": tf.TensorSpec(tf.TensorShape([]), tf.int32),
    #     "movie_id": tf.TensorSpec(tf.TensorShape([]), tf.int32),
    #     "movie_title": tf.TensorSpec(tf.TensorShape([]), tf.string),
    #     "rating": tf.TensorSpec(tf.TensorShape([]), tf.float32),
    #     "timestamp": tf.TensorSpec(tf.TensorShape([]), tf.int32),
    # }
    ratings_columns = {
        "user_id",
        "movie_id",
        "movie_title",
        "rating",
        "timestamp"
    }
kvignesh1420 commented 3 years ago

@dgoldenberg-audiomack I explored your data and prepared parquet IO dataset using the local copy. I think the issue for this error is due to the presence of special characters in movie names.

For example: (from csv)

6713 | Millennium Actress (Sennen joyû) (2001) | Animation\|Drama\|Romance

The ratings ParquetIODataset was unable to read rows such as these and raised the

.InvalidArgumentError: null value in column: movie_title

Please pre-process your dataset and then try to load the data.

dgoldenberg-audiomack commented 3 years ago

@kvignesh1420 Interesting. I'll add some string cleansing code. However, a) why is the parquet reader so sensitive to special characters? and b) the error message is both wrong and misleading. It should say something like "invalid format", "invalid characters detected", or the like.

Can we add an enhancement to the framework's code to either allow the special chars, or disallow them. If some characters are disallowed, can the documentation state which ones and why? From peeking at the Parquet documentation, I don't seem to see any restrictions on character data.

kvignesh1420 commented 3 years ago

@dgoldenberg-audiomack yes, the error message seems misleading as the parquet-cpp reader that we use is unable to read few rows such as these properly and the initial assumption was that the row value might be null.

However, when I tried preparing the ParquetIODataset using the movies data, I found this row as follows:

 <tf.Tensor: shape=(), dtype=string, numpy=b'Millennium Actress (Sennen joy\xc3\xbb) (2001)'>),

It means, parquet-cpp reader is able to read the characters but in a different encoded format.

However, just to cross-verify can you try once again after you clean up the data?

kvignesh1420 commented 3 years ago

UPDATE: After further exploration, I found the following behaviour of core_ops.io_parquet_readable_read to be the root cause of the issue:

In [85]: data = core_ops.io_parquet_readable_read(
    ...:                         input=filename,
    ...:                         shared=filename,
    ...:                         component=components[2],
    ...:                         shape=shapes[2],
    ...:                         start=36362,
    ...:                         stop=36364,
    ...:                         dtype=dtypes[2],
    ...:                         container="ParquetIODataset",
    ...:                     )
2020-12-31 23:24:26.358031: E tensorflow_io/core/kernels/parquet_kernels.cc:246] Levels: 1 Values: 1
2020-12-31 23:24:26.358082: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at io_kernel.h:78 : Invalid argument: null value in column: movie_title
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-85-c5071ae579a4> in <module>
      7                         stop=36364,
      8                         dtype=dtypes[2],
----> 9                         container="ParquetIODataset",
     10                     )

<string> in io_parquet_readable_read(input, shared, component, shape, start, stop, dtype, container, name)

<string> in io_parquet_readable_read_eager_fallback(input, shared, component, shape, start, stop, dtype, container, name, ctx)

~/.tf-io-venv/lib/python3.7/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

InvalidArgumentError: null value in column: movie_title [Op:IO>ParquetReadableRead]

However, when we read the items individually, there is no exception:

In [89]: data = core_ops.io_parquet_readable_read(
    ...:                         input=filename,
    ...:                         shared=filename,
    ...:                         component=components[2],
    ...:                         shape=shapes[2],
    ...:                         start=36362,
    ...:                         stop=36363,
    ...:                         dtype=dtypes[2],
    ...:                         container="ParquetIODataset",
    ...:                     )
2020-12-31 23:27:12.787952: E tensorflow_io/core/kernels/parquet_kernels.cc:246] Levels: 1 Values: 1

In [90]: data = core_ops.io_parquet_readable_read(
    ...:                         input=filename,
    ...:                         shared=filename,
    ...:                         component=components[2],
    ...:                         shape=shapes[2],
    ...:                         start=36363,
    ...:                         stop=36364,
    ...:                         dtype=dtypes[2],
    ...:                         container="ParquetIODataset",
    ...:                     )
2020-12-31 23:27:16.651217: E tensorflow_io/core/kernels/parquet_kernels.cc:246] Levels: 1 Values: 1

@dgoldenberg-audiomack as mentioned above, until I get some insight on this, please let me know after you try using clean data.

dgoldenberg-audiomack commented 3 years ago

@kvignesh1420 Vignesh, I'll add a cleansing step for the string data and update the ticket.

dgoldenberg-audiomack commented 3 years ago

Hi @kvignesh1420,

The cleansing step is not helping. In the specific example

6713 | Millennium Actress (Sennen joyû) (2001) | Animation\|Drama\|Romance

The "special" character is û, a Unicode character. In theory it should cause no problems anywhere, I believe it's totally legitimate for it to be in Parquet. I've added a filter for non-ASCII and re-ran my tester. Still hitting the "null" problem on movie titles.

I can't even iterate the ratings dataset to get to the record where I presumably have a null movie title. Attaching the parquet file.

    count = 0
    for el in ratings_ds.take(100000).as_numpy_iterator():
        title = el["movie_title"]
        if not title:
            print("@@@@@@@ RATING - NO TITLE: " + str(el))
            count += 1
        if count > 10:
            break

Stack:

2021-01-05 19:24:55.904563: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at io_kernel.h:78 : Invalid argument: null value in column: movie_title Traceback (most recent call last): File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2113, in execution_mode yield File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 733, in _next_internal output_shapes=self._flat_output_shapes) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2579, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6862, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: null value in column: movie_title [[{{node IO>ParquetReadableRead}}]] [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/tmp/spark-b91b0aea-9037-4642-a7c2-b46731650e6a/recsys_tfrs_proto.py", line 414, in main(sys.argv) File "/mnt/tmp/spark-b91b0aea-9037-4642-a7c2-b46731650e6a/recsys_tfrs_proto.py", line 112, in main movies_ds, test, train, unique_movie_titles, unique_user_ids = prepare_data(movies_ds, ratings_ds) File "/mnt/tmp/spark-b91b0aea-9037-4642-a7c2-b46731650e6a/recsys_tfrs_proto.py", line 228, in prepare_data for el in ratings_ds.take(100000).as_numpy_iterator(): File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3942, in next return nest.map_structure(lambda x: x.numpy(), next(self._iterator)) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 747, in next return self._next_internal() File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 739, in _next_internal return structure.from_compatible_tensor_list(self._element_spec, ret) File "/usr/lib64/python3.7/contextlib.py", line 130, in exit self.gen.throw(type, value, traceback) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2116, in execution_mode executor_new.wait() File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 69, in wait pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle) tensorflow.python.framework.errors_impl.InvalidArgumentError: null value in column: movie_title [[{{node IO>ParquetReadableRead}}]]

tf_io_1254_parquet_with_cleansed.zip

dgoldenberg-audiomack commented 3 years ago

@kvignesh1420 So, what's the problem in core_ops.io_parquet_readable_read? Will you be instrumenting a fix? And in the meantime, is there a workaround? Thanks.

kvignesh1420 commented 3 years ago

I believe this might be due to the failure of this condition (for BYTE_ARRAY) type.

https://github.com/tensorflow/io/blob/master/tensorflow_io/core/kernels/parquet_kernels.cc#L193-L197.

However, I am somewhat puzzled based on this behaviour: https://github.com/tensorflow/io/issues/1254#issuecomment-753017262

@yongtang can you please help us with this?

helper snippet for the column which fails:

import tensorflow_io as tfio

import tensorflow as tf
from tensorflow_io.core.python.ops import core_ops

filename='/Users/vignesh/Downloads/part-00000-ca0e89bf-ccd7-47e1-925c-9b42c8716c84-c000.snappy.parquet'
columns=None
components, shapes, dtypes = core_ops.io_parquet_readable_info(
                filename, shared=filename, container="ParquetIODataset"
            )
shapes = tf.unstack(shapes)
dtypes = [tf.as_dtype(dtype.numpy()) for dtype in tf.unstack(dtypes)]
components = [component.numpy() for component in tf.unstack(components)]
def dataset_f(component, shape, dtype):
    step = 4096
    indices_start = tf.data.Dataset.range(0, shape[0], step)
    indices_stop = indices_start.skip(1).concatenate(
        tf.data.Dataset.from_tensor_slices([shape[0]])
    )
    dataset = tf.data.Dataset.zip((indices_start, indices_stop))

    def f(start, stop):
        return core_ops.io_parquet_readable_read(
            input=filename,
            shared=filename,
            component=component,
            shape=shape,
            start=start,
            stop=stop,
            dtype=dtype,
            container="ParquetIODataset",
        )

    dataset = dataset.map(f)
    dataset = dataset.unbatch()
    return dataset
entries = list(zip(components, shapes, dtypes))
datasets = [
    dataset_f(component, shape, dtype)
    for component, shape, dtype in entries
]
datasets
for i in datasets[2]:
    x=i
yongtang commented 3 years ago

I will take a look.

yongtang commented 3 years ago

Added PR #1262 for the fix.

kvignesh1420 commented 3 years ago

@dgoldenberg-audiomack since the PR has been merged, you can use the tensorflow-io-nightly python package to use this fix until the next release.