intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.54k stars 1.25k forks source link

[orca] yolov3 example does not work on colab notebook #7349

Open yangw1234 opened 1 year ago

yangw1234 commented 1 year ago

Issue raised by bigdl-user-group: https://groups.google.com/g/bigdl-user-group/c/EFmpV6yWzYw

Example:

https://github.com/intel-analytics/BigDL/tree/main/python/orca/example/learn/tf2/yolov3

Error message:

tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: Using a symbolic `tf.Tensor` as a Python `bool` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.

Potential root cause It guess the problem is caused by serializing a @tf.function decorated function defined in a notebook cell.

After I move the function definition into a .py file, it worked. image

Notebook: https://colab.research.google.com/drive/1odHW_qXNk2TcS2r443YRaRmZtc41CdZV?usp=sharing

hkvision commented 1 year ago

@lalalapotter Take a look at this?

hkvision commented 1 year ago

As I test, seems the issue comes from tf.cond here: https://github.com/intel-analytics/BigDL/blob/main/python/orca/example/learn/tf2/yolov3/yoloV3.py#L91 If I remove tf.cond but just use reduce(y_true, anchor_eq, grid_size), it can work.

@yangw1234 Any ideas on this? Seems tf.cond doesn't accept a Python bool as the first argument, but the error says Using a symbolictf.Tensoras a Pythonboolis not allowed, quite strange...

yangw1234 commented 1 year ago

As I test, seems the issue comes from tf.cond here: https://github.com/intel-analytics/BigDL/blob/main/python/orca/example/learn/tf2/yolov3/yoloV3.py#L91 If I remove tf.cond but just use reduce(y_true, anchor_eq, grid_size), it can work.

@yangw1234 Any ideas on this? Seems tf.cond doesn't accept a Python bool as the first argument, but the error says Using a symbolictf.Tensoras a Pythonboolis not allowed, quite strange...

As I understand it, @tf.function will compile the python bool operation into tensorflow graph operation. The error seems to indicate @tf.function does not take effect when we deserialize it.

I have done two experiments:

  1. directly call model.fit in the notebook and it worked. So the way we write this function is correct.
    model = model_creator({})
    dataset = data_creator({}, batch_size)
    model.fit(dataset)
  2. move the "tf.function" decorated function a separated file and import it in the notebook. This also worked. So this problem is specific to notebook.

I remember that cloudpickle will do something special to the functions defined in notebook (whose module is "main"). I guess this might be where the problem come from.

lalalapotter commented 1 year ago

Seems we can use tf.math.logical_and(tf.reduce_any(anchor_eq), tf.math.logical_not(tf.equal(y_true[i][j][2], 0))) to replace the expression tf.reduce_any(anchor_eq) and not tf.equal(y_true[i][j][2], 0). The error OperatorNotAllowedInGraphError is caused by the mix usage of Python and TensorFlow APIs (refer to link).

Besides, I also have two concerns:

  1. Why directly usage of model_creator and data_creator is fine? it also worked in my tests.
  2. As the link described, the error should be occurred when eager execution is disabled, however, in tf2 eager execution mode should be active by default and I have double checked it as well. So why the error could be encountered in our example?

Notebook: https://colab.research.google.com/drive/1flIDO5FUS0iHofKju-7dbAARM8V9qRd3?usp=sharing

yangw1234 commented 1 year ago

Seems we can use tf.math.logical_and(tf.reduce_any(anchor_eq), tf.math.logical_not(tf.equal(y_true[i][j][2], 0))) to replace the expression tf.reduce_any(anchor_eq) and not tf.equal(y_true[i][j][2], 0). The error OperatorNotAllowedInGraphError is caused by the mix usage of Python and TensorFlow APIs (refer to link).

Besides, I also have two concerns:

  1. Why directly usage of model_creator and data_creator is fine? it also worked in my tests.

Using @tf.function decorator should be able to "compile" python's and or not operations into tensorflow graph operations such as tf.math.logical_and and tf.math.logical_not and that is why the direct usage of model_creator and data_creator works. And the whole script also works when we run it using python yolov3.py.

So the question is why it does not work on the notebook.

  1. As the link described, the error should be occurred when eager execution is disabled, however, in tf2 eager execution mode should be active by default and I have double checked it as well. So why the error could be encountered in our example?

Notebook: https://colab.research.google.com/drive/1flIDO5FUS0iHofKju-7dbAARM8V9qRd3?usp=sharing

hkvision commented 1 year ago

We can first change our code as a workaround so that users won't have this problem when running our example.

Investigating this issue of ray and colab is of low priority. (not the focus of our work)