recommenders-team / recommenders

Best Practices on Recommendation Systems
https://recommenders-team.github.io/recommenders/intro.html
MIT License
18.71k stars 3.06k forks source link

Fixing TF to < 2.16 #2071

Closed miguelgfierro closed 5 months ago

miguelgfierro commented 5 months ago

Description

Related Issues

2073

References

Checklist:

miguelgfierro commented 5 months ago

Getting error in FastAI:

2024-03-18T20:52:18.2319582Z =================================== FAILURES ===================================
2024-03-18T20:52:18.2320778Z _________________________________ test_fastai __________________________________
2024-03-18T20:52:18.2321604Z 
2024-03-18T20:52:18.2323932Z notebooks = ***'als_deep_dive': '/mnt/azureml/cr/j/cda5fa5f89704ed0a7056494d3d4bfae/exe/wd/examples/02_model_collaborative_filtering...rk_movielens': '/mnt/azureml/cr/j/cda5fa5f89704ed0a7056494d3d4bfae/exe/wd/examples/06_benchmarks/movielens.ipynb', ...***
2024-03-18T20:52:18.2325994Z output_notebook = 'output.ipynb', kernel_name = 'python3'
2024-03-18T20:52:18.2326452Z 
2024-03-18T20:52:18.2326631Z     @pytest.mark.notebooks
2024-03-18T20:52:18.2327071Z     @pytest.mark.gpu
2024-03-18T20:52:18.2327593Z     def test_fastai(notebooks, output_notebook, kernel_name):
2024-03-18T20:52:18.2328247Z         notebook_path = notebooks["fastai"]
2024-03-18T20:52:18.2328768Z >       execute_notebook(
2024-03-18T20:52:18.2329189Z             notebook_path,
2024-03-18T20:52:18.2329613Z             output_notebook,
2024-03-18T20:52:18.2330063Z             kernel_name=kernel_name,
2024-03-18T20:52:18.2330736Z             parameters=dict(TOP_K=10, MOVIELENS_DATA_SIZE="mock100", EPOCHS=1),
2024-03-18T20:52:18.2331392Z         )
2024-03-18T20:52:18.2331587Z 
2024-03-18T20:52:18.2331815Z tests/unit/examples/test_notebooks_gpu.py:22: 
2024-03-18T20:52:18.2332930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2024-03-18T20:52:18.2333695Z recommenders/utils/notebook_utils.py:102: in execute_notebook
2024-03-18T20:52:18.2334321Z     executed_notebook, _ = execute_preprocessor.preprocess(
2024-03-18T20:52:18.2335407Z /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/nbconvert/preprocessors/execute.py:102: in preprocess
2024-03-18T20:52:18.2336325Z     self.preprocess_cell(cell, resources, index)
2024-03-18T20:52:18.2337370Z /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/nbconvert/preprocessors/execute.py:123: in preprocess_cell
2024-03-18T20:52:18.2338343Z     cell = self.execute_cell(cell, index, store_history=True)
2024-03-18T20:52:18.2339342Z /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/jupyter_core/utils/__init__.py:165: in wrapped
2024-03-18T20:52:18.2340163Z     return loop.run_until_complete(inner)
2024-03-18T20:52:18.2341044Z /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/asyncio/base_events.py:653: in run_until_complete
2024-03-18T20:52:18.2342013Z     return future.result()
2024-03-18T20:52:18.2343235Z /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/nbclient/client.py:1062: in async_execute_cell
2024-03-18T20:52:18.2344210Z     await self._check_raise_for_error(cell, cell_index, exec_reply)
2024-03-18T20:52:18.2344850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2024-03-18T20:52:18.2345239Z 
2024-03-18T20:52:18.2345637Z self = <nbconvert.preprocessors.execute.ExecutePreprocessor object at 0x14a5c65978d0>
2024-03-18T20:52:18.2347304Z cell = ***'cell_type': 'code', 'execution_count': 18, 'metadata': ***'execution': ***'iopub.status.busy': '2024-03-18T20:51:33.1791...  prediction_col=PREDICTION)\n\nprint("Took *** seconds for *** predictions.".format(test_time, len(training_removed)))'***
2024-03-18T20:52:18.2348511Z cell_index = 30
2024-03-18T20:52:18.2349807Z exec_reply = ***'buffers': [], 'content': ***'ename': 'RuntimeError', 'engine_info': ***'engine_id': -1, 'engine_uuid': '217e3d70-6a11-48...e, 'engine': '217e3d70-6a11-4870-9977-5df29757f686', 'started': '2024-03-18T20:51:33.179501Z', 'status': 'error'***, ...***
2024-03-18T20:52:18.2350858Z 
2024-03-18T20:52:18.2351017Z     async def _check_raise_for_error(
2024-03-18T20:52:18.2351627Z         self, cell: NotebookNode, cell_index: int, exec_reply: dict[str, t.Any] | None
2024-03-18T20:52:18.2352794Z     ) -> None:
2024-03-18T20:52:18.2353331Z         if exec_reply is None:
2024-03-18T20:52:18.2353820Z             return None
2024-03-18T20:52:18.2354211Z     
2024-03-18T20:52:18.2354614Z         exec_reply_content = exec_reply["content"]
2024-03-18T20:52:18.2355224Z         if exec_reply_content["status"] != "error":
2024-03-18T20:52:18.2355771Z             return None
2024-03-18T20:52:18.2356154Z     
2024-03-18T20:52:18.2356613Z         cell_allows_errors = (not self.force_raise_errors) and (
2024-03-18T20:52:18.2357244Z             self.allow_errors
2024-03-18T20:52:18.2357832Z             or exec_reply_content.get("ename") in self.allow_error_names
2024-03-18T20:52:18.2358658Z             or "raises-exception" in cell.metadata.get("tags", [])
2024-03-18T20:52:18.2359253Z         )
2024-03-18T20:52:18.2359602Z         await run_hook(
2024-03-18T20:52:18.2360250Z             self.on_cell_error, cell=cell, cell_index=cell_index, execute_reply=exec_reply
2024-03-18T20:52:18.2360953Z         )
2024-03-18T20:52:18.2361312Z         if not cell_allows_errors:
2024-03-18T20:52:18.2361984Z >           raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)
2024-03-18T20:52:18.2363053Z E           nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
2024-03-18T20:52:18.2363919Z E           ------------------
2024-03-18T20:52:18.2364333Z E           with Timer() as test_time:
2024-03-18T20:52:18.2364823Z E               top_k_scores = score(learner, 
2024-03-18T20:52:18.2365590Z E                                    test_df=training_removed,
2024-03-18T20:52:18.2366160Z E                                    user_col=USER, 
2024-03-18T20:52:18.2366693Z E                                    item_col=ITEM, 
2024-03-18T20:52:18.2367251Z E                                    prediction_col=PREDICTION)
2024-03-18T20:52:18.2367760Z E           
2024-03-18T20:52:18.2368388Z E           print("Took *** seconds for *** predictions.".format(test_time, len(training_removed)))
2024-03-18T20:52:18.2369140Z E           ------------------
2024-03-18T20:52:18.2369526Z E           
2024-03-18T20:52:18.2369842Z E           
2024-03-18T20:52:18.2370410Z E           ---------------------------------------------------------------------------
2024-03-18T20:52:18.2371273Z E           RuntimeError                              Traceback (most recent call last)
2024-03-18T20:52:18.2371979Z E           Cell In[18], line 2
2024-03-18T20:52:18.2372720Z E                 1 with Timer() as test_time:
2024-03-18T20:52:18.2373914Z E           ----> 2     top_k_scores = score(learner, 
2024-03-18T20:52:18.2374945Z E                 3                          test_df=training_removed,
2024-03-18T20:52:18.2375914Z E                 4                          user_col=USER, 
2024-03-18T20:52:18.2376844Z E                 5                          item_col=ITEM, 
2024-03-18T20:52:18.2377920Z E                 6                          prediction_col=PREDICTION)
2024-03-18T20:52:18.2379553Z E                 8 print("Took *** seconds for *** predictions.".format(test_time, len(training_removed)))
2024-03-18T20:52:18.2380709Z E           
2024-03-18T20:52:18.2382034Z E           File /mnt/azureml/cr/j/cda5fa5f89704ed0a7056494d3d4bfae/exe/wd/recommenders/models/fastai/fastai_utils.py:67, in score(learner, test_df, user_col, item_col, prediction_col, top_k)
2024-03-18T20:52:18.2383787Z E                65 if torch.cuda.is_available():
2024-03-18T20:52:18.2384744Z E                66     x = x.to("cuda")
2024-03-18T20:52:18.2386215Z E           ---> 67 pred = learner.model.forward(x).detach().cpu().numpy()
2024-03-18T20:52:18.2387481Z E                68 scores = pd.DataFrame(
2024-03-18T20:52:18.2388360Z E                69     ***user_col: test_df[user_col], item_col: test_df[item_col], prediction_col: pred***
2024-03-18T20:52:18.2389044Z E                70 )
2024-03-18T20:52:18.2390087Z E                71 scores = scores.sort_values([user_col, prediction_col], ascending=[True, False])
2024-03-18T20:52:18.2391015Z E           
2024-03-18T20:52:18.2392094Z E           File /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/fastai/collab.py:48, in EmbeddingDotBias.forward(self, x)
2024-03-18T20:52:18.2393377Z E                46 def forward(self, x):
2024-03-18T20:52:18.2394415Z E                47     users,items = x[:,0],x[:,1]
2024-03-18T20:52:18.2395890Z E           ---> 48     dot = self.u_weight(users)* self.i_weight(items)
2024-03-18T20:52:18.2397789Z E                49     res = dot.sum(1) + self.u_bias(users).squeeze() + self.i_bias(items).squeeze()
2024-03-18T20:52:18.2399480Z E                50     if self.y_range is None: return res
2024-03-18T20:52:18.2400301Z E           
2024-03-18T20:52:18.2401492Z E           File /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
2024-03-18T20:52:18.2403511Z E              1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
2024-03-18T20:52:18.2404612Z E              1510 else:
2024-03-18T20:52:18.2405952Z E           -> 1511     return self._call_impl(*args, **kwargs)
2024-03-18T20:52:18.2407031Z E           
2024-03-18T20:52:18.2408183Z E           File /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
2024-03-18T20:52:18.2409661Z E              1515 # If we don't have any hooks, we want to skip the rest of the logic in
2024-03-18T20:52:18.2410598Z E              1516 # this function, and just call forward.
2024-03-18T20:52:18.2412450Z E              1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
2024-03-18T20:52:18.2414287Z E              1518         or _global_backward_pre_hooks or _global_backward_hooks
2024-03-18T20:52:18.2415370Z E              1519         or _global_forward_hooks or _global_forward_pre_hooks):
2024-03-18T20:52:18.2416779Z E           -> 1520     return forward_call(*args, **kwargs)
2024-03-18T20:52:18.2417830Z E              1522 try:
2024-03-18T20:52:18.2418487Z E              1523     result = None
2024-03-18T20:52:18.2419012Z E           
2024-03-18T20:52:18.2420115Z E           File /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/torch/nn/modules/sparse.py:163, in Embedding.forward(self, input)
2024-03-18T20:52:18.2421694Z E               162 def forward(self, input: Tensor) -> Tensor:
2024-03-18T20:52:18.2423378Z E           --> 163     return F.embedding(
2024-03-18T20:52:18.2425225Z E               164         input, self.weight, self.padding_idx, self.max_norm,
2024-03-18T20:52:18.2427535Z E               165         self.norm_type, self.scale_grad_by_freq, self.sparse)
2024-03-18T20:52:18.2428715Z E           
2024-03-18T20:52:18.2430061Z E           File /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/torch/nn/functional.py:2237, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2024-03-18T20:52:18.2431745Z E              2231     # Note [embedding_renorm set_grad_enabled]
2024-03-18T20:52:18.2432505Z E              2232     # XXX: equivalent to
2024-03-18T20:52:18.2433195Z E              2233     # with torch.no_grad():
2024-03-18T20:52:18.2433897Z E              2234     #   torch.embedding_renorm_
2024-03-18T20:52:18.2434697Z E              2235     # remove once script supports set_grad_enabled
2024-03-18T20:52:18.2435611Z E              2236     _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
2024-03-18T20:52:18.2437402Z E           -> 2237 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2024-03-18T20:52:18.2438700Z E           
2024-03-18T20:52:18.2439831Z E           RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
2024-03-18T20:52:18.2440786Z 
2024-03-18T20:52:18.2441424Z /azureml-envs/azureml_a0d14432a61fd07846aaa46d0fe66974/lib/python3.11/site-packages/nbclient/client.py:918: CellExecutionError

Trying to add the model also to CUDA. Tests: https://github.com/recommenders-team/recommenders/actions/runs/8333772519

miguelgfierro commented 5 months ago

@SimonYansenZhao This PR fixes the issue with TF and FastAI. See unit tests: https://github.com/recommenders-team/recommenders/actions/runs/8333772519

I think we should fix to TF<1.16 instead of 1.15.0, because in 1.15.1 the code works, and maybe they do 15.2 or something

Please take a look