Closed SimonYansenZhao closed 8 months ago
First, I got NameError: name 'XDeepFMModel' is not defined
(See https://github.com/recommenders-team/recommenders/actions/runs/6544943740/job/17772448740#step:3:2992).
I think something might be wrong when importing XDeepFMModel
, then after removing the try-except
in the commit (https://github.com/recommenders-team/recommenders/pull/2022/commits/88e43a5f112c411cf6801b356ee690f2a0ab5cb4#) (See https://github.com/recommenders-team/recommenders/actions/runs/6545498077/job/17774104408?pr=2022#step:3:3012) I got ModuleNotFoundError: No module named 'keras.layers.legacy_rnn'
.
It means Keras.layers.legacy_rnn
imported in https://github.com/recommenders-team/recommenders/blob/b000b78ceb3cbe52a0200922f2b2412d830274af/recommenders/models/deeprec/models/sequential/sum_cells.py#L6 is removed from Keras. Need to find a way to replace it.
FYI @miguelgfierro @anargyri
First, I got
NameError: name 'XDeepFMModel' is not defined
(See https://github.com/recommenders-team/recommenders/actions/runs/6544943740/job/17772448740#step:3:2992).I think something might be wrong when importing
XDeepFMModel
, then after removing thetry-except
in the commit (88e43a5) (See https://github.com/recommenders-team/recommenders/actions/runs/6545498077/job/17774104408?pr=2022#step:3:3012) I gotModuleNotFoundError: No module named 'keras.layers.legacy_rnn'
.It means
Keras.layers.legacy_rnn
imported inis removed from Keras. Need to find a way to replace it. FYI @miguelgfierro @anargyri
It looks like this class is deprecated in the new versions; if I recall, it was one of the temporary fixes when we upgraded from v1 to v2. Now you would have to rewrite the sum cell in the new syntax using this, which may not be straightforward. Maybe we should migrate the entire notebook to PyTorch. @Leavingseason what do you think?
That said, I see the code is still in TF master
https://github.com/tensorflow/tensorflow/tree/r2.14/tensorflow/python/keras/layers/legacy_rnn
They stopped exposing the module but it is still used internally e.g. here
I wonder if copying and pasting the legacy_rnn
code in our repo would work (provided that the TF license allows for this).
It looks like the conda env is installing 2.13. How about restricting to < 2.13 or < 2.12, do you still get the error? If not, we could restrict the TF version from above.
@SimonYansenZhao I checked locally with different versions of TF and this import
from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell
only works up to (and including) version 2.8.
We need to discuss a bit more how to proceed, it is not clear to me what is the best way forward. For now, let's bound TF with < 2.9.
pinged Jianxun on teams
Hi, seen from SimonYansenZhao's discovery, when the TF>=2.9, the reference in the file recommenders/models/deeprec/models/sequential/sum_cells.py, from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell
becomes invalid, which causes the XDeepFMModel to fail to import and run.
Upon examining the LayerRNNCell and RNNCell classes in the tf.keras.layers.legacy_rnn.rnn_cell_impl file, I found that LayerRNNCell simply inherits from the RNNCell class. When calling, it reduces one step of building scope compared to RNNCell, so we can directly use "from keras.layers.legacy_rnn.rnn_cell_impl import RNNCell" as the parent class of SUM_Cell.
In TF>=2.9, the RNNCell class has been deprecated, but it has been rewritten as AbstractRNNCell in tf.keras.layers.recurrent with similar implementation and comments.
Try replace "from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell" with:
try: from keras.layers.legacy_rnn.rnn_cell_impl import LayerRNNCell except: from keras.layers.recurrent import AbstractRNNCell as LayerRNNCell
to see whether it works.
I got the following error when adding from keras.layers.rnn import AbstractRNNCell as LayerRNNCell
. It looks like that AbstractRNNCell
does not have _reuse
. @Master-PLC
When using the latest TensorFlow (2.15), I got this error:
E 713 try:
E 714 # pylint: disable=protected-access
E 715 with self._graph._c_graph.get() as c_graph:
E --> 716 self._session = tf_session.TF_NewSessionRef(c_graph,opts)
E 717 # pylint: enable=protected-access
E 718 finally:
E 719 tf_session.TF_DeleteSessionOptions(opts)
E
E InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
It means CUDA driver version used in the VM is lower than the requirement of TensorFlow. I am trying to figure out how to do that. @miguelgfierro @anargyri
When using the latest TensorFlow (2.15), I got this error:
E 713 try: E 714 # pylint: disable=protected-access E 715 with self._graph._c_graph.get() as c_graph: E --> 716 self._session = tf_session.TF_NewSessionRef(c_graph,opts) E 717 # pylint: enable=protected-access E 718 finally: E 719 tf_session.TF_DeleteSessionOptions(opts) E E InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
It means CUDA driver version used in the VM is lower than the requirement of TensorFlow. I am trying to figure out how to do that. @miguelgfierro @anargyri
Yes, because the latest TF requires cuda 12, previous ones required cuda 11 or less. See https://www.tensorflow.org/install/source#gpu
@SimonYansenZhao if we upgrade the docker image in the test to cuda 12, will the code with previous versions of cuda 11 work?
@SimonYansenZhao if we upgrade the docker image in the test to cuda 12, will the code with previous versions of cuda 11 work?
I think it will work. Only the NVIDIA GPU Driver in the CUDA installed in the Docker image is used when testing, because a new Conda env will be created with a CUDA required by Recommenders, the CUDA in this Conda env will not conflict with the CUDA in the Docker image. The CUDA installed in the Conda env will interact with the driver which is back compatible.
Description
This PR resolves the issue #2018 by commenting out the tests in tests/ci/azureml_tests/test_groups.py to disable testing for deeprec.
Related Issues
2018
References
Checklist:
staging branch
and not tomain branch
.