Open yangw1234 opened 1 year ago
@lalalapotter Take a look at this?
As I test, seems the issue comes from tf.cond
here: https://github.com/intel-analytics/BigDL/blob/main/python/orca/example/learn/tf2/yolov3/yoloV3.py#L91
If I remove tf.cond
but just use reduce(y_true, anchor_eq, grid_size)
, it can work.
@yangw1234 Any ideas on this? Seems tf.cond doesn't accept a Python bool as the first argument, but the error says Using a symbolic
tf.Tensoras a Python
boolis not allowed
, quite strange...
As I test, seems the issue comes from
tf.cond
here: https://github.com/intel-analytics/BigDL/blob/main/python/orca/example/learn/tf2/yolov3/yoloV3.py#L91 If I removetf.cond
but just usereduce(y_true, anchor_eq, grid_size)
, it can work.@yangw1234 Any ideas on this? Seems tf.cond doesn't accept a Python bool as the first argument, but the error says
Using a symbolic
tf.Tensoras a Python
boolis not allowed
, quite strange...
As I understand it, @tf.function will compile the python bool operation into tensorflow graph operation. The error seems to indicate @tf.function does not take effect when we deserialize it.
I have done two experiments:
model = model_creator({})
dataset = data_creator({}, batch_size)
model.fit(dataset)
I remember that cloudpickle will do something special to the functions defined in notebook (whose module is "main"). I guess this might be where the problem come from.
Seems we can use tf.math.logical_and(tf.reduce_any(anchor_eq), tf.math.logical_not(tf.equal(y_true[i][j][2], 0)))
to replace the expression tf.reduce_any(anchor_eq) and not tf.equal(y_true[i][j][2], 0)
. The error OperatorNotAllowedInGraphError
is caused by the mix usage of Python and TensorFlow APIs (refer to link).
Besides, I also have two concerns:
model_creator
and data_creator
is fine? it also worked in my tests.Notebook: https://colab.research.google.com/drive/1flIDO5FUS0iHofKju-7dbAARM8V9qRd3?usp=sharing
Seems we can use
tf.math.logical_and(tf.reduce_any(anchor_eq), tf.math.logical_not(tf.equal(y_true[i][j][2], 0)))
to replace the expressiontf.reduce_any(anchor_eq) and not tf.equal(y_true[i][j][2], 0)
. The errorOperatorNotAllowedInGraphError
is caused by the mix usage of Python and TensorFlow APIs (refer to link).Besides, I also have two concerns:
- Why directly usage of
model_creator
anddata_creator
is fine? it also worked in my tests.
Using @tf.function decorator should be able to "compile" python's and
or
not
operations into tensorflow graph operations such as tf.math.logical_and
and tf.math.logical_not
and that is why the direct usage of model_creator
and data_creator
works. And the whole script also works when we run it using python yolov3.py
.
So the question is why it does not work on the notebook.
- As the link described, the error should be occurred when eager execution is disabled, however, in tf2 eager execution mode should be active by default and I have double checked it as well. So why the error could be encountered in our example?
Notebook: https://colab.research.google.com/drive/1flIDO5FUS0iHofKju-7dbAARM8V9qRd3?usp=sharing
We can first change our code as a workaround so that users won't have this problem when running our example.
Investigating this issue of ray and colab is of low priority. (not the focus of our work)
Issue raised by bigdl-user-group: https://groups.google.com/g/bigdl-user-group/c/EFmpV6yWzYw
Example:
https://github.com/intel-analytics/BigDL/tree/main/python/orca/example/learn/tf2/yolov3
Error message:
Potential root cause It guess the problem is caused by serializing a
@tf.function
decorated function defined in a notebook cell.After I move the function definition into a
.py
file, it worked.Notebook: https://colab.research.google.com/drive/1odHW_qXNk2TcS2r443YRaRmZtc41CdZV?usp=sharing