Compilation of custom operations failing on TF 2.15/CUDA 12

Icemole commented 2 months ago

Hi, the compilation of NativeLstm2.cc is failing with TF 2.15/CUDA 12, and it hadn't failed with TF 2.13/CUDA 11. A colleague of mine is also having similar issues when compiling GetCtcFsaFastBwOp.cc.

There are many errors that are being thrown out by the compiler, but most of them are rather "silly", like:

error: expected a ";"
error: function "Ndarray_get_n_total_elements" has already been defined
error: name followed by "::" must be a class or namespace name

This leads me to think that the nvcc compiler might be doing weird stuff here, and as a consequence that the operationss don't work with CUDA 12 as they are. I was also told that TF might play a role here, so I also posted the TF versions. Could there be a redundant file? Maybe incompatible CUDA versions?

nvcc version where the compilation works:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

nvcc version where the compilation doesn't work:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Let me know if I can provide any further details. Thanks in advance!

albertz commented 2 months ago

Related is also #1513.

Can you post the full output?

albertz commented 2 months ago

Can you try to run on CPU only (export DISABLE_CUDA=1)? Can you try to run test_TFNativeOp.py?

Icemole commented 2 months ago

Please find here the full output of the compilation on CUDA.

Answering your questions:

A non-CUDA environment, CPU only works!
python3 -m pytest test_TFNativeOp.py also works, but I'm not sure if I'm running the test with CUDA enabled (if that makes any difference for the test). Besides, there are some skipped tests as well as some warnings. I'm running these in a machine that has GPUs available. Please see the results below.

test_TFNativeOp.py ......................................................sssssss                                                                                    [100%] 

============================================================================ warnings summary =============================================================================
../../../../../../../../../../../.venvs/singularity/returnn_test_native_op/lib/python3.10/site-packages/nose/plugins/manager.py:418
  /home/nbeneitez/.venvs/singularity/returnn_test_native_op/lib/python3.10/site-packages/nose/plugins/manager.py:418: DeprecationWarning: pkg_resources is deprecated as an
 API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

../../../../../../../../../../../.venvs/singularity/returnn_test_native_op/lib/python3.10/site-packages/nose/importer.py:12
  /home/nbeneitez/.venvs/singularity/returnn_test_native_op/lib/python3.10/site-packages/nose/importer.py:12: DeprecationWarning: the imp module is deprecated in favour of
 importlib and slated for removal in Python 3.12; see the module's documentation for alternative uses
    from imp import find_module, load_module, acquire_lock, release_lock

../../../../../../../../../../../.venvs/singularity/returnn_test_native_op/lib/python3.10/site-packages/numpy/__config__.py:155
  /home/nbeneitez/.venvs/singularity/returnn_test_native_op/lib/python3.10/site-packages/numpy/__config__.py:155: UserWarning: Install `pyyaml` for better output
    warnings.warn("Install `pyyaml` for better output", stacklevel=1)

tests/test_TFNativeOp.py::test_py_viterbi
  /home/nbeneitez/work/returnn/native_op_issue/work/i6_core/tools/git/CloneGitRepositoryJob.nH5B7CKRCU89/output/repository/tests/test_TFNativeOp.py:2224: RuntimeWarning: d
ivide by zero encountered in log
    am_scores = numpy.log(am_scores)  # in +log space

tests/test_TFNativeOp.py::test_fast_viterbi
  /home/nbeneitez/work/returnn/native_op_issue/work/i6_core/tools/git/CloneGitRepositoryJob.nH5B7CKRCU89/output/repository/tests/test_TFNativeOp.py:2277: RuntimeWarning: d
ivide by zero encountered in log
    am_scores = numpy.log(am_scores)  # in +log space

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================== 54 passed, 7 skipped, 5 warnings in 193.13s (0:03:13) ==========================================================

Should the test_TFNativeOp.py fail for me? As I said, I might be doing something wrong.

albertz commented 2 months ago

python3 -m pytest test_TFNativeOp.py also works

I assume you tested that with export DISABLE_CUDA=1, i.e. only for CPU? Can you also try with CUDA?

albertz commented 2 months ago

Note, the main error is error: name followed by "::" must be a class or namespace name on perftools:

/home/nbeneitez/work/returnn/native_op_issue/work/i6_core/tools/git/CloneGitRepositoryJob.nH5B7CKRCU89/output/repository/returnn/native_op.cpp(240): error: name followed by "::" must be a class or namespace name
  perftools::gputools::DeviceMemory<T> AsDeviceMemory(const T* cuda_memory) {
  ^

I guess they moved/renamed that. I see in other TF code that it is se::DeviceMemory<T> (or maybe tensorflow::se::DeviceMemory<T> or stream_executor::DeviceMemory<T> or so) now.

Similarly, in our static perftools::gputools::blas::Transpose get_transpose, I think it is stream_executor::blas::Transpose or so now.

rwth-i6 / returnn

Compilation of custom operations failing on TF 2.15/CUDA 12 #1523