uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

[ML-10156] Fix array type field inferred shape #517

Closed WeichenXu123 closed 4 years ago

WeichenXu123 commented 4 years ago

I set the inferred shape for array type field to be (None,) instead of (). This will address issues on tensorflow dataset.

Test code:

import os
import pandas as pd
import sys
import numpy as np
from pyspark.sql.functions import pandas_udf
import tensorflow as tf

from petastorm import make_batch_reader
from petastorm.transform import TransformSpec
from petastorm.spark import make_spark_converter
spark.conf.set('petastorm.spark.converter.parentCacheDirUrl', 'file:/tmp/converter')

data_url = 'file:/tmp/0001'
data_path = '/tmp/t0001'

@pandas_udf('array<float>')
def gen_array(v):
  return v.map(lambda x: np.random.rand(10))

df1 = spark.range(10).withColumn('v', gen_array('id')).repartition(2)
cv1 = make_spark_converter(df1)

# we can auto infer one-dim array shape
with cv1.make_tf_dataset(batch_size=4, num_epochs=1) as dataset:
    iter = dataset.make_one_shot_iterator()
    next_op = iter.get_next()
    with tf.Session() as sess:
        for i in range(3):
            batch = sess.run(next_op)
            print(batch)

Before Raise error like:

2020-03-25 12:02:26.230197: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at tensor_slice_dataset_op.cc:193 : Invalid argument: Incompatible shapes at component 1: expected [10] but got [].
2020-03-25 12:02:26.231091: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at tensor_slice_dataset_op.cc:193 : Invalid argument: Incompatible shapes at component 1: expected [10] but got [].
Traceback (most recent call last):
  File "/Users/weichenxu/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/Users/weichenxu/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/Users/weichenxu/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __inference_Dataset_flat_map_from_tensor_slices_45}} Incompatible shapes at component 1: expected [10] but got [].
     [[{{node TensorSliceDataset}}]]
     [[IteratorGetNext]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 7, in <module>
  File "/Users/weichenxu/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/Users/weichenxu/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/Users/weichenxu/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/Users/weichenxu/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Incompatible shapes at component 1: expected [10] but got [].
     [[{{node TensorSliceDataset}}]]
     [[IteratorGetNext]]

After Code works well.

WeichenXu123 commented 4 years ago

Note:

based on this PR, for array field, the reader will generate tensorflow dataset field with shape (?, ?),

if we run keras model without set shape on the field, we may still get error like:

raise ValueError('The last dimension of the inputs to `Dense` '
ValueError: The last dimension of the inputs to `Dense` should be defined. Found `None`.

In order to make it work on keras model.fit, we need manaully set tensorflow dataset field shape, such as

def set_shape(x):
  x.features.set_shape((None, 784))
  return x
tf_dataset.map(set_shape)

or

tf_dataset.map(lambda x: (tf.reshape(x.features, shape=...), x.label))
codecov[bot] commented 4 years ago

Codecov Report

Merging #517 into master will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #517   +/-   ##
=======================================
  Coverage   86.18%   86.18%           
=======================================
  Files          81       81           
  Lines        4465     4467    +2     
  Branches      717      717           
=======================================
+ Hits         3848     3850    +2     
  Misses        505      505           
  Partials      112      112           
Impacted Files Coverage Δ
petastorm/unischema.py 94.76% <100.00%> (+0.05%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4a99b9b...1adf556. Read the comment docs.

WeichenXu123 commented 4 years ago

@selitvin The test in PR description has been added into unit test.

WeichenXu123 commented 4 years ago

@selitvin spark 2.x has compatibility issue with pyarrow>=0.15, so I skip test on pyarrow>=0.15. But don't worry, we will soon upgrade it to spark 3.0 here and the vector support also require spark 3.0 https://github.com/uber/petastorm/pull/521