intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

OverflowError encountered when calling `tf2.Estimator.predict()` #422

Open GZHoffie opened 3 years ago

GZHoffie commented 3 years ago

I wanted to use a tf2.Estimator in a LSTM network. The network looks like the following

from tensorflow.keras import layers

def build_model(config):
    model = keras.Sequential()
    model.add(layers.LSTM(128, input_shape=(60, 57)))
    model.add(layers.Dense(57, activation='softmax'))
    optimizer = keras.optimizers.RMSprop(lr=0.01)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)
    return model

The input (x) of this network is of shape (1, 60, 57), where each of the 60 arrays are one-hot arrays indicating which character is present. and the output of the network is a (1, 57) sofmax array that shows the probability of each of the 57 characters being the next character. The output is compared with a (1, 57) one-hot array for training.

And my Estimator is built as the following

from zoo.orca.learn.tf2 import Estimator

est = Estimator.from_keras(model_creator=build_model,
                           config={},
                           workers_per_node=1,
                           verbose=0)

The Estimator works fine when training on my training set.

However, when it comes to prediction, I used a np.array called sampled of shape (1, 60, 57) where again, the 60 arrays are one-hot arrays. I transformed it into XShards and made predictions using

sample_shards = XShards.partition({"x": sampled})
preds = est.predict(sample_shards)

But it is not working, and reports the following error.

/intern/spark/spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 9, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/intern/spark/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/intern/spark/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/zhenhao/analytics-zoo/pyzoo/zoo/orca/data/ray_rdd.py", line 133, in <lambda>
    lambda idx, _: get_from_ray(idx, address, password, meta_store_name))
  File "/home/zhenhao/analytics-zoo/pyzoo/zoo/orca/data/ray_rdd.py", line 99, in get_from_ray
    partition = ray.get(object_id)
  File "/home/zhenhao/.local/lib/python3.7/site-packages/ray/worker.py", line 1513, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OverflowError): ray::TFRunner.predict() (pid=3152, ip=10.239.44.107)
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 407, in ray._raylet.execute_task.function_executor
  File "/home/zhenhao/analytics-zoo/pyzoo/zoo/orca/learn/tf2/tf_runner.py", line 451, in predict
    new_part = [predict_fn(shard) for shard in partition]
  File "/home/zhenhao/analytics-zoo/pyzoo/zoo/orca/learn/tf2/tf_runner.py", line 451, in <listcomp>
    new_part = [predict_fn(shard) for shard in partition]
  File "/home/zhenhao/analytics-zoo/pyzoo/zoo/orca/learn/tf2/tf_runner.py", line 448, in predict_fn
    y = local_model.predict(shard["x"], **params)
  File "/home/zhenhao/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 130, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/zhenhao/.local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1613, in predict
    callbacks.on_predict_end()
  File "/home/zhenhao/.local/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 582, in on_predict_end
    callback.on_predict_end(logs)
  File "/home/zhenhao/.local/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 979, in on_predict_end
    self._finalize_progbar(logs)
  File "/home/zhenhao/.local/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py", line 1026, in _finalize_progbar
    self.progbar.update(self.seen, list(logs.items()), finalize=True)
  File "/home/zhenhao/.local/lib/python3.7/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 581, in update
    numdigits = int(np.log10(self.target)) + 1
OverflowError: cannot convert float infinity to integer

For the code and all the error message, you can refer to this notebook. What might be the problem? Thank you!

GZHoffie commented 3 years ago

Resolved by collecting the sampled array of shape (1, 60, 57) to form a np.array of shape (4, 60, 57) or larger, so that no shard is empty. (The XShards.partition splits the array into 4 parts in my case)

jason-dai commented 3 years ago

Resolved by collecting the sampled array of shape (1, 60, 57) to form a np.array of shape (4, 60, 57) or larger, so that no shard is empty. (The XShards.partition splits the array into 4 parts in my case)

Shall we check and skip empty shards? @jenniew

jenniew commented 3 years ago

Yes, we shall need to check empty shards in TFRunner.