recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
19.12k stars 3.09k forks source link

Resolve issue #2018 #2022

Closed SimonYansenZhao closed 8 months ago

SimonYansenZhao commented 1 year ago

Description

This PR resolves the issue #2018 by commenting out the tests in tests/ci/azureml_tests/test_groups.py to disable testing for deeprec.

Related Issues

References

Checklist:

SimonYansenZhao commented 1 year ago

First, I got NameError: name 'XDeepFMModel' is not defined (See https://github.com/recommenders-team/recommenders/actions/runs/6544943740/job/17772448740#step:3:2992).

I think something might be wrong when importing XDeepFMModel, then after removing the try-except in the commit (https://github.com/recommenders-team/recommenders/pull/2022/commits/88e43a5f112c411cf6801b356ee690f2a0ab5cb4#) (See https://github.com/recommenders-team/recommenders/actions/runs/6545498077/job/17774104408?pr=2022#step:3:3012) I got ModuleNotFoundError: No module named 'keras.layers.legacy_rnn' .

It means Keras.layers.legacy_rnn imported in https://github.com/recommenders-team/recommenders/blob/b000b78ceb3cbe52a0200922f2b2412d830274af/recommenders/models/deeprec/models/sequential/sum_cells.py#L6 is removed from Keras. Need to find a way to replace it.

FYI @miguelgfierro @anargyri

anargyri commented 1 year ago

First, I got NameError: name 'XDeepFMModel' is not defined (See https://github.com/recommenders-team/recommenders/actions/runs/6544943740/job/17772448740#step:3:2992).

I think something might be wrong when importing XDeepFMModel, then after removing the try-except in the commit (88e43a5) (See https://github.com/recommenders-team/recommenders/actions/runs/6545498077/job/17774104408?pr=2022#step:3:3012) I got ModuleNotFoundError: No module named 'keras.layers.legacy_rnn' .

It means Keras.layers.legacy_rnn imported in

https://github.com/recommenders-team/recommenders/blob/b000b78ceb3cbe52a0200922f2b2412d830274af/recommenders/models/deeprec/models/sequential/sum_cells.py#L6

is removed from Keras. Need to find a way to replace it. FYI @miguelgfierro @anargyri

It looks like this class is deprecated in the new versions; if I recall, it was one of the temporary fixes when we upgraded from v1 to v2. Now you would have to rewrite the sum cell in the new syntax using this, which may not be straightforward. Maybe we should migrate the entire notebook to PyTorch. @Leavingseason what do you think?

That said, I see the code is still in TF master https://github.com/tensorflow/tensorflow/tree/r2.14/tensorflow/python/keras/layers/legacy_rnn They stopped exposing the module but it is still used internally e.g. here I wonder if copying and pasting the legacy_rnn code in our repo would work (provided that the TF license allows for this).

It looks like the conda env is installing 2.13. How about restricting to < 2.13 or < 2.12, do you still get the error? If not, we could restrict the TF version from above.

anargyri commented 1 year ago

@SimonYansenZhao I checked locally with different versions of TF and this import from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell only works up to (and including) version 2.8.

anargyri commented 1 year ago

We need to discuss a bit more how to proceed, it is not clear to me what is the best way forward. For now, let's bound TF with < 2.9.

miguelgfierro commented 1 year ago

pinged Jianxun on teams

Master-PLC commented 1 year ago

Hi, seen from SimonYansenZhao's discovery, when the TF>=2.9, the reference in the file recommenders/models/deeprec/models/sequential/sum_cells.py, from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell becomes invalid, which causes the XDeepFMModel to fail to import and run.

Upon examining the LayerRNNCell and RNNCell classes in the tf.keras.layers.legacy_rnn.rnn_cell_impl file, I found that LayerRNNCell simply inherits from the RNNCell class. When calling, it reduces one step of building scope compared to RNNCell, so we can directly use "from keras.layers.legacy_rnn.rnn_cell_impl import RNNCell" as the parent class of SUM_Cell.

In TF>=2.9, the RNNCell class has been deprecated, but it has been rewritten as AbstractRNNCell in tf.keras.layers.recurrent with similar implementation and comments.

Try replace "from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell" with: try: from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell except: from keras.layers.recurrent import AbstractRNNCell as LayerRNNCell to see whether it works.

SimonYansenZhao commented 9 months ago

I got the following error when adding from keras.layers.rnn import AbstractRNNCell as LayerRNNCell. It looks like that AbstractRNNCell does not have _reuse. @Master-PLC

Screenshot 2024-01-22 at 22 40 36

SimonYansenZhao commented 9 months ago

When using the latest TensorFlow (2.15), I got this error:

E               713 try:
E               714   # pylint: disable=protected-access
E               715   with self._graph._c_graph.get() as c_graph:
E           --> 716     self._session = tf_session.TF_NewSessionRef(c_graph,opts)
E               717   # pylint: enable=protected-access
E               718 finally:
E               719   tf_session.TF_DeleteSessionOptions(opts)
E           
E           InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

It means CUDA driver version used in the VM is lower than the requirement of TensorFlow. I am trying to figure out how to do that. @miguelgfierro @anargyri

anargyri commented 9 months ago

When using the latest TensorFlow (2.15), I got this error:

E               713 try:
E               714   # pylint: disable=protected-access
E               715   with self._graph._c_graph.get() as c_graph:
E           --> 716     self._session = tf_session.TF_NewSessionRef(c_graph,opts)
E               717   # pylint: enable=protected-access
E               718 finally:
E               719   tf_session.TF_DeleteSessionOptions(opts)
E           
E           InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

It means CUDA driver version used in the VM is lower than the requirement of TensorFlow. I am trying to figure out how to do that. @miguelgfierro @anargyri

Yes, because the latest TF requires cuda 12, previous ones required cuda 11 or less. See https://www.tensorflow.org/install/source#gpu

miguelgfierro commented 9 months ago

@SimonYansenZhao if we upgrade the docker image in the test to cuda 12, will the code with previous versions of cuda 11 work?

SimonYansenZhao commented 8 months ago

@SimonYansenZhao if we upgrade the docker image in the test to cuda 12, will the code with previous versions of cuda 11 work?

I think it will work. Only the NVIDIA GPU Driver in the CUDA installed in the Docker image is used when testing, because a new Conda env will be created with a CUDA required by Recommenders, the CUDA in this Conda env will not conflict with the CUDA in the Docker image. The CUDA installed in the Conda env will interact with the driver which is back compatible.