Resolve issue #2018 - Githubissues

SimonYansenZhao commented 1 year ago

Description

This PR resolves the issue #2018 by commenting out the tests in tests/ci/azureml_tests/test_groups.py to disable testing for deeprec.

Related Issues

2018

References

TensorFlow
- TensorFlow release history@PyPI
- Requirements for older TensorFlow versions
Versioning
- PEP 440 – Version Identification and Dependency Specification
- Version comparison

Checklist:

[X] I have followed the contribution guidelines and code style for this project.
[ ] I have added tests covering my contributions.
[ ] I have updated the documentation accordingly.
[X] This PR is being made to staging branch and not to main branch.

SimonYansenZhao commented 1 year ago

First, I got NameError: name 'XDeepFMModel' is not defined (See https://github.com/recommenders-team/recommenders/actions/runs/6544943740/job/17772448740#step:3:2992).

I think something might be wrong when importing XDeepFMModel, then after removing the try-except in the commit (https://github.com/recommenders-team/recommenders/pull/2022/commits/88e43a5f112c411cf6801b356ee690f2a0ab5cb4#) (See https://github.com/recommenders-team/recommenders/actions/runs/6545498077/job/17774104408?pr=2022#step:3:3012) I got ModuleNotFoundError: No module named 'keras.layers.legacy_rnn' .

It means Keras.layers.legacy_rnn imported in https://github.com/recommenders-team/recommenders/blob/b000b78ceb3cbe52a0200922f2b2412d830274af/recommenders/models/deeprec/models/sequential/sum_cells.py#L6 is removed from Keras. Need to find a way to replace it.

FYI @miguelgfierro @anargyri

anargyri commented 1 year ago

First, I got NameError: name 'XDeepFMModel' is not defined (See https://github.com/recommenders-team/recommenders/actions/runs/6544943740/job/17772448740#step:3:2992).

I think something might be wrong when importing XDeepFMModel, then after removing the try-except in the commit (88e43a5) (See https://github.com/recommenders-team/recommenders/actions/runs/6545498077/job/17774104408?pr=2022#step:3:3012) I got ModuleNotFoundError: No module named 'keras.layers.legacy_rnn' .

It means Keras.layers.legacy_rnn imported in

https://github.com/recommenders-team/recommenders/blob/b000b78ceb3cbe52a0200922f2b2412d830274af/recommenders/models/deeprec/models/sequential/sum_cells.py#L6

is removed from Keras. Need to find a way to replace it. FYI @miguelgfierro @anargyri

It looks like this class is deprecated in the new versions; if I recall, it was one of the temporary fixes when we upgraded from v1 to v2. Now you would have to rewrite the sum cell in the new syntax using this, which may not be straightforward. Maybe we should migrate the entire notebook to PyTorch. @Leavingseason what do you think?

That said, I see the code is still in TF master https://github.com/tensorflow/tensorflow/tree/r2.14/tensorflow/python/keras/layers/legacy_rnn They stopped exposing the module but it is still used internally e.g. here I wonder if copying and pasting the legacy_rnn code in our repo would work (provided that the TF license allows for this).

It looks like the conda env is installing 2.13. How about restricting to < 2.13 or < 2.12, do you still get the error? If not, we could restrict the TF version from above.

anargyri commented 1 year ago

@SimonYansenZhao I checked locally with different versions of TF and this import from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell only works up to (and including) version 2.8.

anargyri commented 1 year ago

We need to discuss a bit more how to proceed, it is not clear to me what is the best way forward. For now, let's bound TF with < 2.9.

miguelgfierro commented 1 year ago

pinged Jianxun on teams

Master-PLC commented 1 year ago

Hi, seen from SimonYansenZhao's discovery, when the TF>=2.9, the reference in the file recommenders/models/deeprec/models/sequential/sum_cells.py, from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell becomes invalid, which causes the XDeepFMModel to fail to import and run.

Dive into LayerRNNCell

Upon examining the LayerRNNCell and RNNCell classes in the tf.keras.layers.legacy_rnn.rnn_cell_impl file, I found that LayerRNNCell simply inherits from the RNNCell class. When calling, it reduces one step of building scope compared to RNNCell, so we can directly use "from keras.layers.legacy_rnn.rnn_cell_impl import RNNCell" as the parent class of SUM_Cell.

Alternative for LayerRNNCell

In TF>=2.9, the RNNCell class has been deprecated, but it has been rewritten as AbstractRNNCell in tf.keras.layers.recurrent with similar implementation and comments.

Suggested solution

Try replace "from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell" with: try: from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell except: from keras.layers.recurrent import AbstractRNNCell as LayerRNNCell to see whether it works.

SimonYansenZhao commented 9 months ago

I got the following error when adding from keras.layers.rnn import AbstractRNNCell as LayerRNNCell. It looks like that AbstractRNNCell does not have _reuse. @Master-PLC

Screenshot 2024-01-22 at 22 40 36

SimonYansenZhao commented 9 months ago

When using the latest TensorFlow (2.15), I got this error:

E               713 try:
E               714   # pylint: disable=protected-access
E               715   with self._graph._c_graph.get() as c_graph:
E           --> 716     self._session = tf_session.TF_NewSessionRef(c_graph,opts)
E               717   # pylint: enable=protected-access
E               718 finally:
E               719   tf_session.TF_DeleteSessionOptions(opts)
E           
E           InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

It means CUDA driver version used in the VM is lower than the requirement of TensorFlow. I am trying to figure out how to do that. @miguelgfierro @anargyri

anargyri commented 9 months ago

When using the latest TensorFlow (2.15), I got this error:

E               713 try:
E               714   # pylint: disable=protected-access
E               715   with self._graph._c_graph.get() as c_graph:
E           --> 716     self._session = tf_session.TF_NewSessionRef(c_graph,opts)
E               717   # pylint: enable=protected-access
E               718 finally:
E               719   tf_session.TF_DeleteSessionOptions(opts)
E           
E           InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

It means CUDA driver version used in the VM is lower than the requirement of TensorFlow. I am trying to figure out how to do that. @miguelgfierro @anargyri

Yes, because the latest TF requires cuda 12, previous ones required cuda 11 or less. See https://www.tensorflow.org/install/source#gpu

miguelgfierro commented 9 months ago

@SimonYansenZhao if we upgrade the docker image in the test to cuda 12, will the code with previous versions of cuda 11 work?

SimonYansenZhao commented 8 months ago

@SimonYansenZhao if we upgrade the docker image in the test to cuda 12, will the code with previous versions of cuda 11 work?

I think it will work. Only the NVIDIA GPU Driver in the CUDA installed in the Docker image is used when testing, because a new Conda env will be created with a CUDA required by Recommenders, the CUDA in this Conda env will not conflict with the CUDA in the Docker image. The CUDA installed in the Conda env will interact with the driver which is back compatible.

recommenders-team / recommenders

Resolve issue #2018 #2022

Description

Related Issues

2018

References

Checklist: