recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
18.71k stars 3.06k forks source link

Programmatic execution of notebooks #2031

Closed miguelgfierro closed 8 months ago

miguelgfierro commented 10 months ago

Description

This PR removes papermill and scrapbook and adds the same functionality

Related Issues

Fixes https://github.com/recommenders-team/recommenders/issues/2012

References

Checklist:

miguelgfierro commented 10 months ago

Weird error. The input is 100k, but the regex parser inputs 10k

tests/functional/examples/test_notebooks_gpu.py F                                                                                                                     [100%]

================================================================================= FAILURES ==================================================================================
______________________________________________________ test_ncf_deep_dive_functional[100k-10-512-expected_values0-42] _______________________________________________________

notebooks = {'als_deep_dive': '/home/u/MS/recommenders/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'als_pyspar...aseline_deep_dive.ipynb', 'benchmark_movielens': '/home/u/MS/recommenders/examples/06_benchmarks/movielens.ipynb', ...}
output_notebook = 'output.ipynb', kernel_name = 'python3', size = '100k', epochs = 10, batch_size = 512
expected_values = {'map': 0.0435856, 'map2': 0.0510391, 'ndcg': 0.37586, 'ndcg2': 0.202186, ...}, seed = 42

    @pytest.mark.gpu
    @pytest.mark.notebooks
    @pytest.mark.parametrize(
        "size, epochs, batch_size, expected_values, seed",
        [
            (
                "100k",
                10,
                512,
                {
                    "map": 0.0435856,
                    "ndcg": 0.37586,
                    "precision": 0.169353,
                    "recall": 0.0923963,
                    "map2": 0.0510391,
                    "ndcg2": 0.202186,
                    "precision2": 0.179533,
                    "recall2": 0.106434,
                },
                42,
            )
        ],
    )
    def test_ncf_deep_dive_functional(
        notebooks,
        output_notebook,
        kernel_name,
        size,
        epochs,
        batch_size,
        expected_values,
        seed,
    ):
        notebook_path = notebooks["ncf_deep_dive"]
>       execute_notebook(
            notebook_path,
            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(
                TOP_K=10,
                MOVIELENS_DATA_SIZE=size,
                EPOCHS=epochs,
                BATCH_SIZE=batch_size,
                SEED=seed,
            ),
        )

tests/functional/examples/test_notebooks_gpu.py:91:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
recommenders/utils/notebook_utils.py:99: in execute_notebook
    executed_notebook, _ = execute_preprocessor.preprocess(
../../anaconda/envs/recommenders/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py:100: in preprocess
    self.preprocess_cell(cell, resources, index)
../../anaconda/envs/recommenders/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py:121: in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
../../anaconda/envs/recommenders/lib/python3.9/site-packages/jupyter_core/utils/__init__.py:166: in wrapped
    return loop.run_until_complete(inner)
../../anaconda/envs/recommenders/lib/python3.9/asyncio/base_events.py:647: in run_until_complete
    return future.result()
../../anaconda/envs/recommenders/lib/python3.9/site-packages/nbclient/client.py:1058: in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x7fc7855dc6d0>
cell = {'cell_type': 'code', 'execution_count': 3, 'metadata': {'execution': {'iopub.status.busy': '2023-10-31T16:26:54.66711...oad_pandas_df(\n    size=MOVIELENS_DATA_SIZE,\n    header=["userID", "itemID", "rating", "timestamp"]\n)\n\ndf.head()'}
cell_index = 9
exec_reply = {'buffers': [], 'content': {'ename': 'ValueError', 'engine_info': {'engine_id': -1, 'engine_uuid': 'e1defabf-6d1f-40f2...e, 'engine': 'e1defabf-6d1f-40f2-a86b-3533e758ecca', 'started': '2023-10-31T16:26:54.667305Z', 'status': 'error'}, ...}

    async def _check_raise_for_error(
        self, cell: NotebookNode, cell_index: int, exec_reply: t.Optional[t.Dict]
    ) -> None:
        if exec_reply is None:
            return None

        exec_reply_content = exec_reply['content']
        if exec_reply_content['status'] != 'error':
            return None

        cell_allows_errors = (not self.force_raise_errors) and (
            self.allow_errors
            or exec_reply_content.get('ename') in self.allow_error_names
            or "raises-exception" in cell.metadata.get("tags", [])
        )
        await run_hook(
            self.on_cell_error, cell=cell, cell_index=cell_index, execute_reply=exec_reply
        )
        if not cell_allows_errors:
>           raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
E           nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
E           ------------------
E           df = movielens.load_pandas_df(
E               size=MOVIELENS_DATA_SIZE,
E               header=["userID", "itemID", "rating", "timestamp"]
E           )
E
E           df.head()
E           ------------------
E
E
E           ---------------------------------------------------------------------------
E           ValueError                                Traceback (most recent call last)
E           Cell In[3], line 1
E           ----> 1 df = movielens.load_pandas_df(
E                 2     size=MOVIELENS_DATA_SIZE,
E                 3     header=["userID", "itemID", "rating", "timestamp"]
E                 4 )
E                 6 df.head()
E
E           File ~/MS/recommenders/recommenders/datasets/movielens.py:201, in load_pandas_df(size, header, local_cache_path, title_col, genres_col, year_col)
E               199 size = size.lower()
E               200 if size not in DATA_FORMAT and size not in MOCK_DATA_FORMAT:
E           --> 201     raise ValueError(f"Size: {size}. " + ERROR_MOVIE_LENS_SIZE)
E               203 if header is None:
E               204     header = DEFAULT_HEADER
E
E           ValueError: Size: 10k. Invalid data size. Should be one of {100k, 1m, 10m, or 20m, or mock100}

../../anaconda/envs/recommenders/lib/python3.9/site-packages/nbclient/client.py:914: CellExecutionError
============================================================================= warnings summary ==============================================================================
../../anaconda/envs/recommenders/lib/python3.9/site-packages/jupyter_client/connect.py:20
  /home/u/anaconda/envs/recommenders/lib/python3.9/site-packages/jupyter_client/connect.py:20: DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
  given by the platformdirs library.  To remove this warning and
  see the appropriate new directories, set the environment variable
  `JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
  The use of platformdirs will be the default in `jupyter_core` v6
    from jupyter_core.paths import jupyter_data_dir, jupyter_runtime_dir, secure_write

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================== short test summary info ==========================================================================
FAILED tests/functional/examples/test_notebooks_gpu.py::test_ncf_deep_dive_functional[100k-10-512-expected_values0-42] - nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
======================================================================= 1 failed, 1 warning in 5.23s ========================================================================

Similar error but with a different notebook:

    @pytest.mark.notebooks
    @pytest.mark.experimental
    def test_rlrmc_quickstart_runs(notebooks, output_notebook, kernel_name):
        notebook_path = notebooks["rlrmc_quickstart"]
>       execute_notebook(
            notebook_path,
            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(rank_parameter=2, MOVIELENS_DATA_SIZE="mock100"),
        )

tests/unit/examples/test_notebooks_python.py:88: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
recommenders/utils/notebook_utils.py:99: in execute_notebook
    executed_notebook, _ = execute_preprocessor.preprocess(
/azureml-envs/azureml_8854b0bdccc7bb7425b7c3f2145bc96f/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py:102: in preprocess
    self.preprocess_cell(cell, resources, index)
/azureml-envs/azureml_8854b0bdccc7bb7425b7c3f2145bc96f/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py:123: in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
/azureml-envs/azureml_8854b0bdccc7bb7425b7c3f2145bc96f/lib/python3.9/site-packages/jupyter_core/utils/__init__.py:173: in wrapped
    return loop.run_until_complete(inner)
/azureml-envs/azureml_8854b0bdccc7bb7425b7c3f2145bc96f/lib/python3.9/asyncio/base_events.py:647: in run_until_complete
    return future.result()
/azureml-envs/azureml_8854b0bdccc7bb7425b7c3f2145bc96f/lib/python3.9/site-packages/nbclient/client.py:1058: in async_execute_cell
    await self._check_raise_for_error(cell, cell_index, exec_reply)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x14e3f8053940>
cell = ***'cell_type': 'code', 'execution_count': 4, 'metadata': ***'execution': ***'iopub.status.busy': '2023-10-31T16:49:29.42054...= movielens.load_pandas_df(\n    size=MOVIELENS_DATA_SIZE,\n    header=["userID", "itemID", "rating", "timestamp"]\n)'***
cell_index = 7
exec_reply = ***'buffers': [], 'content': ***'ename': 'ValueError', 'engine_info': ***'engine_id': -1, 'engine_uuid': '79883de8-cb38-47df...e, 'engine': '79883de8-cb38-47df-a3b2-e63115050117', 'started': '2023-10-31T16:49:29.420956Z', 'status': 'error'***, ...***

    async def _check_raise_for_error(
        self, cell: NotebookNode, cell_index: int, exec_reply: t.Optional[t.Dict]
    ) -> None:
        if exec_reply is None:
            return None

        exec_reply_content = exec_reply['content']
        if exec_reply_content['status'] != 'error':
            return None

        cell_allows_errors = (not self.force_raise_errors) and (
            self.allow_errors
            or exec_reply_content.get('ename') in self.allow_error_names
            or "raises-exception" in cell.metadata.get("tags", [])
        )
        await run_hook(
            self.on_cell_error, cell=cell, cell_index=cell_index, execute_reply=exec_reply
        )
        if not cell_allows_errors:
>           raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
E           nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
E           ------------------
E           
E           df = movielens.load_pandas_df(
E               size=MOVIELENS_DATA_SIZE,
E               header=["userID", "itemID", "rating", "timestamp"]
E           )
E           ------------------
E           
E           
E           ---------------------------------------------------------------------------
E           ValueError                                Traceback (most recent call last)
E           Cell In[4], line 1
E           ----> 1 df = movielens.load_pandas_df(
E                 2 size=MOVIELENS_DATA_SIZE,
E                 3 header=["userID","itemID","rating","timestamp"]
E                 4 )
E           
E           File /mnt/azureml/cr/j/f1d53f64bdb5410196f4cc9b6e069605/exe/wd/recommenders/datasets/movielens.py:201, in load_pandas_df(size, header, local_cache_path, title_col, genres_col, year_col)
E               199 size = size.lower()
E               200 if size not in DATA_FORMAT and size not in MOCK_DATA_FORMAT:
E           --> 201     raise ValueError(f"Size: ***size***. " + ERROR_MOVIE_LENS_SIZE)
E               203 if header is None:
E               204     header = DEFAULT_HEADER
E           
E           ValueError: Size: 2m. Invalid data size. Should be one of ***100k, 1m, 10m, or 20m, or mock100***
loomlike commented 10 months ago

@miguelgfierro Sorry to ask dumb question as I missed the discussion, but why do we reinvent the wheel here? Couldn't papermill to execute the notebook (which seems to be still actively developed, not like the scrapbook) + other open-sourced recording packages, e.g. mlflow for recording and verifying metrics? Actually mlflow recording codes will show a good example for recording metrics too in our notebooks...

miguelgfierro commented 10 months ago

It seems that papermill is also not maintained: https://pypi.org/project/papermill/#history. They haven't updated it in over a year. MLFlow for recording is an interesting idea, the only problem would be that we would add another dependency. One of the reasons to do this from scratch is to reduce dependencies.

This code doesn't add any new dependency and it will allow us to do the same functionality we had. If in the future, there is an appetite to change the recording of the data with MLFlow, we can add it.

SimonYansenZhao commented 9 months ago

@miguelgfierro I think the pattern matching is incorrect. See the example below that uses the pattern matching in execute_notebook():

>>> import re
>>> pattern = re.compile(rf"\bmy_param\s*=\s*([^#\n]+)(?:#.*$)?", re.MULTILINE)
>>> cell_source = "\"my_param = 'abc'\n\", \"another_param = 'abc'\n\""
>>> matches = re.findall(pattern, "\"my_param = 'abc'\n\", \"another_param = 'abc'\n\"")
>>> matches
["'abc'"]
>>> cell_source.replace(matches[0].strip(), '10')
'"my_param = 10\n", "another_param = 10\n"'

All parameters whose value is 'abc' above are changed.

SimonYansenZhao commented 9 months ago

@miguelgfierro I fixed the pattern matching bug. Now a new error is catched. I'll take a look the day after tomorrow.

loomlike commented 9 months ago

@miguelgfierro I think the pattern matching is incorrect. See the example below that uses the pattern matching in execute_notebook():

>>> import re
>>> pattern = re.compile(rf"\bmy_param\s*=\s*([^#\n]+)(?:#.*$)?", re.MULTILINE)
>>> cell_source = "\"my_param = 'abc'\n\", \"another_param = 'abc'\n\""
>>> matches = re.findall(pattern, "\"my_param = 'abc'\n\", \"another_param = 'abc'\n\"")
>>> matches
["'abc'"]
>>> cell_source.replace(matches[0].strip(), '10')
'"my_param = 10\n", "another_param = 10\n"'

All parameters whose value is 'abc' above are changed.

@SimonYansenZhao can we modularize the parameter pattern matching & replace part to pull out from execute_notebook so that we can unit tests better?

SimonYansenZhao commented 9 months ago

The pattern matching in the notebook utils now cannot extract multiline parameter values. For example the following test https://github.com/recommenders-team/recommenders/blob/b000b78ceb3cbe52a0200922f2b2412d830274af/tests/unit/examples/test_notebooks_gpu.py#L77-L94

when doing the value substitution for RANKING_METRICS in 00_quick_start/wide_deep_movielens.ipynb

RANKING_METRICS = [
    evaluator.ndcg_at_k.__name__,
    evaluator.precision_at_k.__name__,
]

will leads the following result:

RANKING_METRICS = ["ndcg_at_k"]
    evaluator.ndcg_at_k.__name__,
    evaluator.precision_at_k.__name__,
]

So the current solution is to rewrite all multiline parameters into one line. See the commit.

SimonYansenZhao commented 9 months ago

@miguelgfierro I think the pattern matching is incorrect. See the example below that uses the pattern matching in execute_notebook():

>>> import re
>>> pattern = re.compile(rf"\bmy_param\s*=\s*([^#\n]+)(?:#.*$)?", re.MULTILINE)
>>> cell_source = "\"my_param = 'abc'\n\", \"another_param = 'abc'\n\""
>>> matches = re.findall(pattern, "\"my_param = 'abc'\n\", \"another_param = 'abc'\n\"")
>>> matches
["'abc'"]
>>> cell_source.replace(matches[0].strip(), '10')
'"my_param = 10\n", "another_param = 10\n"'

All parameters whose value is 'abc' above are changed.

@SimonYansenZhao can we modularize the parameter pattern matching & replace part to pull out from execute_notebook so that we can unit tests better?

@loomlike Sure, but now we need to make all tests passed before refactoring.

miguelgfierro commented 9 months ago

there is an error, the system doesn't install cuda 11, but 12:

2023-11-18 08:08:06.357546: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 08:08:06.357591: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-11-18 08:08:09.200050: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 08:08:09.200183: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 08:08:09.200272: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 08:08:09.200354: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 08:08:09.200437: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 08:08:09.200518: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 08:08:09.200624: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:
2023-11-18 08:08:09.200711: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /azureml-envs/azureml_34c56b1c46d7f5ae137d78e9a4192235/lib:

This was installed:
INFO:submit_groupwise_azureml_pytest.py: nvidia-cublas-cu12:12.1.3.1
INFO:submit_groupwise_azureml_pytest.py: nvidia-cuda-cupti-cu12:12.1.105
INFO:submit_groupwise_azureml_pytest.py: nvidia-cuda-nvrtc-cu12:12.1.105
INFO:submit_groupwise_azureml_pytest.py: nvidia-cuda-runtime-cu12:12.1.105
INFO:submit_groupwise_azureml_pytest.py: nvidia-cudnn-cu12:8.9.2.26
INFO:submit_groupwise_azureml_pytest.py: nvidia-cufft-cu12:11.0.2.54
INFO:submit_groupwise_azureml_pytest.py: nvidia-curand-cu12:10.3.2.106
INFO:submit_groupwise_azureml_pytest.py: nvidia-cusolver-cu12:11.4.5.107
INFO:submit_groupwise_azureml_pytest.py: nvidia-cusparse-cu12:12.1.0.106
INFO:submit_groupwise_azureml_pytest.py: nvidia-ml-py3:7.352.0
INFO:submit_groupwise_azureml_pytest.py: nvidia-nccl-cu12:2.18.1
INFO:submit_groupwise_azureml_pytest.py: nvidia-nvjitlink-cu12:12.3.101
INFO:submit_groupwise_azureml_pytest.py: nvidia-nvtx-cu12:12.1.105

INFO:submit_groupwise_azureml_pytest.py: tensorboard:2.8.0
INFO:submit_groupwise_azureml_pytest.py: tensorboard-data-server:0.6.1
INFO:submit_groupwise_azureml_pytest.py: tensorboard-plugin-wit:1.8.1
INFO:submit_groupwise_azureml_pytest.py: tensorflow:2.8.4
INFO:submit_groupwise_azureml_pytest.py: tensorflow-estimator:2.8.0
INFO:submit_groupwise_azureml_pytest.py: tensorflow-io-gcs-filesystem:0.34.0

INFO:submit_groupwise_azureml_pytest.py: torch:2.1.1
INFO:submit_groupwise_azureml_pytest.py: torchvision:0.16.1

I tried to nvidia-ml-py3>=7.352.0,<12, but the CUDA is still 12. See: https://github.com/recommenders-team/recommenders/actions/runs/6980488739/job/18995849434

I tried to nvidia-ml-py3>=7.352.0,<11, and removed all test except the GPU and triggered PR gate. -> same error https://github.com/recommenders-team/recommenders/actions/runs/6981304867/job/18998574212

Tried remove nvidia-ml-py3 and comment transformers from base deps -> same error https://github.com/recommenders-team/recommenders/actions/runs/6982265055/job/19001085823 it is not clear who is installing the nvidia packages

Tried to comment pytorch ->still getting installed cuda 12 https://github.com/recommenders-team/recommenders/actions/runs/6987799139/job/19014709919

Tried commenting pytorch, fastai, tfslim and leave only tensorflow==2.8.4 -> I don´t get the cuda 12 here https://github.com/recommenders-team/recommenders/actions/runs/6988238260/job/19015619226 so one of them is installing it.

Tried with TF and torch ->Torch is installing cuda12 like nvidia-cublas-cu12. https://github.com/recommenders-team/recommenders/actions/runs/6991615620/job/19022370652

Trying tensorflow==2.8.4 and torch>=1.13.1,<2 ->It installs some nvidia libs but not all that are needed: These are installed nvidia-cublas-cu11:11.10.3.66, nvidia-cuda-nvrtc-cu11:11.7.99, nvidia-cuda-runtime-cu11:11.7.99, nvidia-cudnn-cu11:8.5.0.96, but some are not like Could not load dynamic library 'libcudart.so.11.0' see https://github.com/recommenders-team/recommenders/actions/runs/6993911589/job/19027085507

Trying tensorflow==2.8.4 and torch>=1.13.1,<2 and add all nvidia-cu11 deps "nvidia-cublas-cu11","nvidia-cuda-cupti-cu11","nvidia-cuda-nvrtc-cu11","nvidia-cuda-runtime-cu11","nvidia-cudnn-cu11","nvidia-cufft-cu11","nvidia-curand-cu11","nvidia-cusolver-cu11","nvidia-cusparse-cu11","nvidia-ml-py3","nvidia-nccl-cu11","nvidia-nvjitlink-cu11","nvidia-nvtx-cu11", -> error https://github.com/recommenders-team/recommenders/actions/runs/6994198154 nvidia-nvjitlink-cu11 doesn't exist. I removed it

Try again without nvidia-nvjitlink-cu11 -> I still get the time out error. See https://github.com/recommenders-team/recommenders/actions/runs/7007905771/job/19063048904 nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 600 seconds.

Try torch>=1.13.1,<2@https://download.pytorch.org/whl/cu118 -> error: error in recommenders setup command: 'extras_require' must be a dictionary whose values are strings or lists of strings containing valid project/version requirement specifiers

Installed in local with:

        "nvidia-cublas-cu11",
        "nvidia-cuda-cupti-cu11",
        "nvidia-cuda-nvrtc-cu11",
        "nvidia-cuda-runtime-cu11",
        "nvidia-cudnn-cu11",
        "nvidia-cufft-cu11",
        "nvidia-curand-cu11",
        "nvidia-cusolver-cu11",
        "nvidia-cusparse-cu11",
        "nvidia-ml-py3",
        "nvidia-nccl-cu11",
        "nvidia-nvtx-cu11",
        "tensorflow==2.8.4",  # FIXME: Temporarily pinned due to issue with TF version > 2.10.1 See #2018
        "torch>=1.13.1,<2",

Got an the same error:

~/MS/recommenders$ pytest tests/unit/examples/test_notebooks_gpu.py::test_dkn_quickstart
============================= test session starts ==============================platform linux -- Python 3.9.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /home/u/MS/recommenders
configfile: pyproject.toml
plugins: cov-4.1.0, typeguard-4.1.5, anyio-4.1.0, mock-3.12.0, hypothesis-6.91.0collected 1 item

tests/unit/examples/test_notebooks_gpu.py F                              [100%]

=================================== FAILURES ===================================_____________________________ test_dkn_quickstart ______________________________
self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x7fa6432c7130>
msg_id = '5e3667f3-7c5d09fe769d0a7452444f4c_11851_8'
cell = {'cell_type': 'code', 'execution_count': 7, 'metadata': {'pycharm': {'is_executing': False}, 'scrolled': True, 'execut..., 'iopub.execute_input': '2023-11-28T12:00:46.180970Z'}}, 'outputs': [], 'source': 'model.fit(train_file, valid_file)'}
timeout = 600
task_poll_output_msg = <Task pending name='Task-37' coro=<NotebookClient._async_poll_output_msg() running at /home/u/anaconda/envs/test_reco/...da/envs/test_reco/lib/python3.9/site-packages/zmq/_future.py:412, <TaskWakeupMethWrapper object at 0x7fa642661e50>()]>>
task_poll_kernel_alive = <Task cancelled name='Task-36' coro=<NotebookClient._async_poll_kernel_alive() done, defined at /home/u/anaconda/envs/test_reco/lib/python3.9/site-packages/nbclient/client.py:821>>

    async def _async_poll_for_reply(
        self,
        msg_id: str,
        cell: NotebookNode,
        timeout: int | None,
        task_poll_output_msg: asyncio.Future[t.Any],
        task_poll_kernel_alive: asyncio.Future[t.Any],
    ) -> dict[str, t.Any]:
        msg: dict[str, t.Any]
        assert self.kc is not None
        new_timeout: float | None = None
        if timeout is not None:
            deadline = monotonic() + timeout
            new_timeout = float(timeout)
        error_on_timeout_execute_reply = None
        while True:
            try:
                if error_on_timeout_execute_reply:
                    msg = error_on_timeout_execute_reply  # type:ignore[unreachable]
                    msg["parent_header"] = {"msg_id": msg_id}
                else:
>                   msg = await ensure_async(self.kc.shell_channel.get_msg(timeout=new_timeout))

../../anaconda/envs/test_reco/lib/python3.9/site-packages/nbclient/client.py:782:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../anaconda/envs/test_reco/lib/python3.9/site-packages/jupyter_core/utils/__init__.py:189: in ensure_async
    result = await obj
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <jupyter_client.channels.AsyncZMQSocketChannel object at 0x7fa643014c40>
timeout = 600000.0

    async def get_msg(  # type:ignore[override]
        self, timeout: t.Optional[float] = None
    ) -> t.Dict[str, t.Any]:
        """Gets a message if there is one that is ready."""
        assert self.socket is not None
        if timeout is not None:
            timeout *= 1000  # seconds to ms
        ready = await self.socket.poll(timeout)
        if ready:
            res = await self._recv()
            return res
        else:
>           raise Empty
E           _queue.Empty

../../anaconda/envs/test_reco/lib/python3.9/site-packages/jupyter_client/channels.py:315: Empty

During handling of the above exception, another exception occurred:

notebooks = {'als_deep_dive': '/home/u/MS/recommenders/examples/02_model_collaborative_filtering/als_deep_dive.ipynb', 'als_pyspar...aseline_deep_dive.ipynb', 'benchmark_movielens': '/home/u/MS/recommenders/examples/06_benchmarks/movielens.ipynb', ...}
output_notebook = 'output.ipynb', kernel_name = 'python3'

    @pytest.mark.notebooks
    @pytest.mark.gpu
    def test_dkn_quickstart(notebooks, output_notebook, kernel_name):
        notebook_path = notebooks["dkn_quickstart"]
>       execute_notebook(
            notebook_path,
            output_notebook,
            kernel_name=kernel_name,
            parameters=dict(EPOCHS=1, BATCH_SIZE=500),
        )

tests/unit/examples/test_notebooks_gpu.py:118:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
recommenders/utils/notebook_utils.py:107: in execute_notebook
    executed_notebook, _ = execute_preprocessor.preprocess(
../../anaconda/envs/test_reco/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py:102: in preprocess
    self.preprocess_cell(cell, resources, index)
../../anaconda/envs/test_reco/lib/python3.9/site-packages/nbconvert/preprocessors/execute.py:123: in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
../../anaconda/envs/test_reco/lib/python3.9/site-packages/jupyter_core/utils/__init__.py:173: in wrapped
    return loop.run_until_complete(inner)
../../anaconda/envs/test_reco/lib/python3.9/asyncio/base_events.py:647: in run_until_complete
    return future.result()
../../anaconda/envs/test_reco/lib/python3.9/site-packages/nbclient/client.py:1005: in async_execute_cell
    exec_reply = await self.task_poll_for_reply
../../anaconda/envs/test_reco/lib/python3.9/site-packages/nbclient/client.py:806: in _async_poll_for_reply
    error_on_timeout_execute_reply = await self._async_handle_timeout(timeout, cell)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x7fa6432c7130>
timeout = 600
cell = {'cell_type': 'code', 'execution_count': 7, 'metadata': {'pycharm': {'is_executing': False}, 'scrolled': True, 'execut..., 'iopub.execute_input': '2023-11-28T12:00:46.180970Z'}}, 'outputs': [], 'source': 'model.fit(train_file, valid_file)'}

    async def _async_handle_timeout(
        self, timeout: int, cell: NotebookNode | None = None
    ) -> None | dict[str, t.Any]:
        self.log.error("Timeout waiting for execute reply (%is)." % timeout)
        if self.interrupt_on_timeout:
            self.log.error("Interrupting kernel")
            assert self.km is not None
            await ensure_async(self.km.interrupt_kernel())
            if self.error_on_timeout:
                execute_reply = {"content": {**self.error_on_timeout, "status": "error"}}
                return execute_reply
            return None
        else:
            assert cell is not None
>           raise CellTimeoutError.error_from_timeout_and_cell(
                "Cell execution timed out", timeout, cell
            )
E           nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 600 seconds.
E           The message was: Cell execution timed out.
E           Here is a preview of the cell contents:
E           -------------------
E           model.fit(train_file, valid_file)
E           -------------------

../../anaconda/envs/test_reco/lib/python3.9/site-packages/nbclient/client.py:856: CellTimeoutError
----------------------------- Captured stderr call -----------------------------2023-11-28 13:00:18.703631: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-28 13:00:18.774212: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2023-11-28 13:00:18.786748: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-11-28 13:00:19.809460: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-28 13:00:19.815773: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:922] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-11-28 13:00:19.815805: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-11-28 13:00:21.593795: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 300000000 exceeds 10% of free system memory.
2023-11-28 13:00:23.448917: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 300000000 exceeds 10% of free system memory.
2023-11-28 13:00:25.435549: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 300000000 exceeds 10% of free system memory.
2023-11-28 13:00:26.060217: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 300000000 exceeds 10% of free system memory.
2023-11-28 13:00:26.659435: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 300000000 exceeds 10% of free system memory.
------------------------------ Captured log call -------------------------------ERROR    traitlets:client.py:845 Timeout waiting for execute reply (600s).
=============================== warnings summary ===============================../../anaconda/envs/test_reco/lib/python3.9/site-packages/jupyter_client/connect.py:22
  /home/u/anaconda/envs/test_reco/lib/python3.9/site-packages/jupyter_client/connect.py:22: DeprecationWarning: Jupyter is migrating its paths to use standard platformdirs
  given by the platformdirs library.  To remove this warning and
  see the appropriate new directories, set the environment variable
  `JUPYTER_PLATFORM_DIRS=1` and then run `jupyter --paths`.
  The use of platformdirs will be the default in `jupyter_core` v6
    from jupyter_core.paths import jupyter_data_dir, jupyter_runtime_dir, secure_write

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================FAILED tests/unit/examples/test_notebooks_gpu.py::test_dkn_quickstart - nbclient.exceptions.CellTimeoutError: A cell timed out while it was being e...