Closed dgoldenberg-audiomack closed 3 years ago
Tried with the column list as a list of just names, same error.
# ratings_columns = {
# "user_id": tf.TensorSpec(tf.TensorShape([]), tf.int32),
# "movie_id": tf.TensorSpec(tf.TensorShape([]), tf.int32),
# "movie_title": tf.TensorSpec(tf.TensorShape([]), tf.string),
# "rating": tf.TensorSpec(tf.TensorShape([]), tf.float32),
# "timestamp": tf.TensorSpec(tf.TensorShape([]), tf.int32),
# }
ratings_columns = {
"user_id",
"movie_id",
"movie_title",
"rating",
"timestamp"
}
@dgoldenberg-audiomack I explored your data and prepared parquet IO dataset using the local copy. I think the issue for this error is due to the presence of special
characters in movie names.
For example: (from csv)
6713 | Millennium Actress (Sennen joyû) (2001) | Animation\|Drama\|Romance
The ratings ParquetIODataset was unable to read rows such as these and raised the
.InvalidArgumentError: null value in column: movie_title
Please pre-process your dataset and then try to load the data.
@kvignesh1420 Interesting. I'll add some string cleansing code. However, a) why is the parquet reader so sensitive to special characters? and b) the error message is both wrong and misleading. It should say something like "invalid format", "invalid characters detected", or the like.
Can we add an enhancement to the framework's code to either allow the special chars, or disallow them. If some characters are disallowed, can the documentation state which ones and why? From peeking at the Parquet documentation, I don't seem to see any restrictions on character data.
@dgoldenberg-audiomack yes, the error message seems misleading as the parquet-cpp
reader that we use is unable to read few rows such as these properly and the initial assumption was that the row value might be null.
However, when I tried preparing the ParquetIODataset using the movies data, I found this row as follows:
<tf.Tensor: shape=(), dtype=string, numpy=b'Millennium Actress (Sennen joy\xc3\xbb) (2001)'>),
It means, parquet-cpp
reader is able to read the characters but in a different encoded format.
However, just to cross-verify can you try once again after you clean up the data?
UPDATE: After further exploration, I found the following behaviour of core_ops.io_parquet_readable_read
to be the root cause of the issue:
In [85]: data = core_ops.io_parquet_readable_read(
...: input=filename,
...: shared=filename,
...: component=components[2],
...: shape=shapes[2],
...: start=36362,
...: stop=36364,
...: dtype=dtypes[2],
...: container="ParquetIODataset",
...: )
2020-12-31 23:24:26.358031: E tensorflow_io/core/kernels/parquet_kernels.cc:246] Levels: 1 Values: 1
2020-12-31 23:24:26.358082: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at io_kernel.h:78 : Invalid argument: null value in column: movie_title
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-85-c5071ae579a4> in <module>
7 stop=36364,
8 dtype=dtypes[2],
----> 9 container="ParquetIODataset",
10 )
<string> in io_parquet_readable_read(input, shared, component, shape, start, stop, dtype, container, name)
<string> in io_parquet_readable_read_eager_fallback(input, shared, component, shape, start, stop, dtype, container, name, ctx)
~/.tf-io-venv/lib/python3.7/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
58 ctx.ensure_initialized()
59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
62 if name is not None:
InvalidArgumentError: null value in column: movie_title [Op:IO>ParquetReadableRead]
However, when we read the items individually, there is no exception:
In [89]: data = core_ops.io_parquet_readable_read(
...: input=filename,
...: shared=filename,
...: component=components[2],
...: shape=shapes[2],
...: start=36362,
...: stop=36363,
...: dtype=dtypes[2],
...: container="ParquetIODataset",
...: )
2020-12-31 23:27:12.787952: E tensorflow_io/core/kernels/parquet_kernels.cc:246] Levels: 1 Values: 1
In [90]: data = core_ops.io_parquet_readable_read(
...: input=filename,
...: shared=filename,
...: component=components[2],
...: shape=shapes[2],
...: start=36363,
...: stop=36364,
...: dtype=dtypes[2],
...: container="ParquetIODataset",
...: )
2020-12-31 23:27:16.651217: E tensorflow_io/core/kernels/parquet_kernels.cc:246] Levels: 1 Values: 1
@dgoldenberg-audiomack as mentioned above, until I get some insight on this, please let me know after you try using clean data.
@kvignesh1420 Vignesh, I'll add a cleansing step for the string data and update the ticket.
Hi @kvignesh1420,
The cleansing step is not helping. In the specific example
6713 | Millennium Actress (Sennen joyû) (2001) | Animation\|Drama\|Romance
The "special" character is û, a Unicode character. In theory it should cause no problems anywhere, I believe it's totally legitimate for it to be in Parquet. I've added a filter for non-ASCII and re-ran my tester. Still hitting the "null" problem on movie titles.
I can't even iterate the ratings dataset to get to the record where I presumably have a null movie title. Attaching the parquet file.
count = 0
for el in ratings_ds.take(100000).as_numpy_iterator():
title = el["movie_title"]
if not title:
print("@@@@@@@ RATING - NO TITLE: " + str(el))
count += 1
if count > 10:
break
Stack:
2021-01-05 19:24:55.904563: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at io_kernel.h:78 : Invalid argument: null value in column: movie_title Traceback (most recent call last): File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2113, in execution_mode yield File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 733, in _next_internal output_shapes=self._flat_output_shapes) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2579, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6862, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "
", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: null value in column: movie_title [[{{node IO>ParquetReadableRead}}]] [Op:IteratorGetNext] During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/mnt/tmp/spark-b91b0aea-9037-4642-a7c2-b46731650e6a/recsys_tfrs_proto.py", line 414, in
main(sys.argv) File "/mnt/tmp/spark-b91b0aea-9037-4642-a7c2-b46731650e6a/recsys_tfrs_proto.py", line 112, in main movies_ds, test, train, unique_movie_titles, unique_user_ids = prepare_data(movies_ds, ratings_ds) File "/mnt/tmp/spark-b91b0aea-9037-4642-a7c2-b46731650e6a/recsys_tfrs_proto.py", line 228, in prepare_data for el in ratings_ds.take(100000).as_numpy_iterator(): File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3942, in next return nest.map_structure(lambda x: x.numpy(), next(self._iterator)) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 747, in next return self._next_internal() File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 739, in _next_internal return structure.from_compatible_tensor_list(self._element_spec, ret) File "/usr/lib64/python3.7/contextlib.py", line 130, in exit self.gen.throw(type, value, traceback) File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 2116, in execution_mode executor_new.wait() File "/home/hadoop/.local/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 69, in wait pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle) tensorflow.python.framework.errors_impl.InvalidArgumentError: null value in column: movie_title [[{{node IO>ParquetReadableRead}}]]
@kvignesh1420 So, what's the problem in core_ops.io_parquet_readable_read
? Will you be instrumenting a fix? And in the meantime, is there a workaround? Thanks.
I believe this might be due to the failure of this condition (for BYTE_ARRAY) type.
https://github.com/tensorflow/io/blob/master/tensorflow_io/core/kernels/parquet_kernels.cc#L193-L197.
However, I am somewhat puzzled based on this behaviour: https://github.com/tensorflow/io/issues/1254#issuecomment-753017262
@yongtang can you please help us with this?
helper snippet for the column which fails:
import tensorflow_io as tfio
import tensorflow as tf
from tensorflow_io.core.python.ops import core_ops
filename='/Users/vignesh/Downloads/part-00000-ca0e89bf-ccd7-47e1-925c-9b42c8716c84-c000.snappy.parquet'
columns=None
components, shapes, dtypes = core_ops.io_parquet_readable_info(
filename, shared=filename, container="ParquetIODataset"
)
shapes = tf.unstack(shapes)
dtypes = [tf.as_dtype(dtype.numpy()) for dtype in tf.unstack(dtypes)]
components = [component.numpy() for component in tf.unstack(components)]
def dataset_f(component, shape, dtype):
step = 4096
indices_start = tf.data.Dataset.range(0, shape[0], step)
indices_stop = indices_start.skip(1).concatenate(
tf.data.Dataset.from_tensor_slices([shape[0]])
)
dataset = tf.data.Dataset.zip((indices_start, indices_stop))
def f(start, stop):
return core_ops.io_parquet_readable_read(
input=filename,
shared=filename,
component=component,
shape=shape,
start=start,
stop=stop,
dtype=dtype,
container="ParquetIODataset",
)
dataset = dataset.map(f)
dataset = dataset.unbatch()
return dataset
entries = list(zip(components, shapes, dtypes))
datasets = [
dataset_f(component, shape, dtype)
for component, shape, dtype in entries
]
datasets
for i in datasets[2]:
x=i
I will take a look.
Added PR #1262 for the fix.
@dgoldenberg-audiomack since the PR has been merged, you can use the tensorflow-io-nightly
python package to use this fix until the next release.
Loading movielens data from Parquet I pre-generated, into an IODataset, yields error as below (stack included).
I'm attaching sample parquet files. I do not see any records where the
movie_title
column value would be null.Code is also attached. The logic flow is:
Check either of the attached .parquet files, esp. the ratings file, e.g. with
parquet-tools cat --json part-00000-b1706a65-3a09-4e0b-a7fb-33e5144d046c-c000.snappy.parquet
. No null/empty movie_title values.The error appears to be coming from ./tensorflow_io/core/kernels/parquet_kernels.cc.
Trace:
tf_io_issue_parquet_to_ds_null_column.zip