tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
454 stars 67 forks source link

[Bug Report] `create_venv.sh` failing to find and install PyTorch #11144

Closed marty1885 closed 1 month ago

marty1885 commented 2 months ago

Describe the bug

Running ./create_venv,sh now fails to install PyTorch.

./create_venv.sh 
Creating virtual env in: /home/marty/Documents/tt-metal/python_env
Forcefully using a version of pip that will work with our view of editable installs
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
Collecting pip==20.1.1
  Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.2
    Uninstalling pip-22.0.2:
      Successfully uninstalled pip-22.0.2
Successfully installed pip-20.1.1
Setting up virtual env
Writing to /home/marty/.config/pip/pip.conf
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
Requirement already satisfied: setuptools in ./python_env/lib/python3.10/site-packages (59.6.0)
Requirement already satisfied: wheel in ./python_env/lib/python3.10/site-packages (0.44.0)
Installing dev dependencies
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
Collecting platformdirs<4.0.0
  Using cached platformdirs-3.11.0-py3-none-any.whl (17 kB)
Collecting pre-commit==3.0.4
  Using cached pre_commit-3.0.4-py2.py3-none-any.whl (202 kB)
Collecting black==24.3.0
  Using cached black-24.3.0-py3-none-any.whl (201 kB)
Collecting clang-format==18.1.5
  Using cached clang_format-18.1.5-py2.py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
Collecting build==0.10.0
  Using cached build-0.10.0-py3-none-any.whl (17 kB)
Collecting twine==4.0.2
  Using cached twine-4.0.2-py3-none-any.whl (36 kB)
Collecting yamllint==1.32.0
  Using cached yamllint-1.32.0-py3-none-any.whl (65 kB)
Collecting mypy==1.9.0
  Using cached mypy-1.9.0-py3-none-any.whl (2.6 MB)
Collecting docutils==0.18.1
  Using cached docutils-0.18.1-py2.py3-none-any.whl (570 kB)
Collecting sphinx==7.1.2
  Using cached sphinx-7.1.2-py3-none-any.whl (3.2 MB)
Collecting sphinx-rtd-theme==1.3.0
  Using cached sphinx_rtd_theme-1.3.0-py2.py3-none-any.whl (2.8 MB)
Collecting sphinxcontrib-email==0.3.5
  Using cached sphinxcontrib_email-0.3.5-py3-none-any.whl (6.3 kB)
Collecting lxml==4.9.4
  Using cached lxml-4.9.4.tar.gz (3.6 MB)
Collecting breathe==4.35.0
  Using cached breathe-4.35.0-py3-none-any.whl (92 kB)
Collecting nbsphinx==0.9.3
  Using cached nbsphinx-0.9.3-py3-none-any.whl (31 kB)
Collecting sphinxcontrib-jquery==4.1
  Using cached sphinxcontrib_jquery-4.1-py2.py3-none-any.whl (121 kB)
Collecting ipython==8.12.3
  Using cached ipython-8.12.3-py3-none-any.whl (798 kB)
Processing /home/marty/.cache/pip/wheels/76/27/c2/c26175310aadcb8741b77657a1bb49c50cc7d4cdbf9eee0005/pandoc-2.3-py3-none-any.whl
Collecting tabulate==0.9.0
  Using cached tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting myst-parser==3.0.0
  Using cached myst_parser-3.0.0-py3-none-any.whl (83 kB)
Collecting elasticsearch
  Using cached elasticsearch-8.14.0-py3-none-any.whl (480 kB)
Collecting termcolor
  Using cached termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Collecting beautifultable
  Using cached beautifultable-1.1.0-py2.py3-none-any.whl (28 kB)
Collecting pytest==7.2.2
  Using cached pytest-7.2.2-py3-none-any.whl (317 kB)
Collecting pytest-timeout==2.2.0
  Using cached pytest_timeout-2.2.0-py3-none-any.whl (13 kB)
Collecting pytest-split==0.8.2
  Using cached pytest_split-0.8.2-py3-none-any.whl (11 kB)
Collecting pytest-xdist==3.6.1
  Using cached pytest_xdist-3.6.1-py3-none-any.whl (46 kB)
Processing /home/marty/.cache/pip/wheels/38/39/2a/ba16f42aa6cee6f9d58b851a7e00da3fe6e891c10e8a6f068d/jsbeautifier-1.14.7-py3-none-any.whl
Collecting datasets==2.9.0
  Using cached datasets-2.9.0-py3-none-any.whl (462 kB)
ERROR: Could not find a version that satisfies the requirement torch==2.2.1.0+cpu (from -r /home/marty/Documents/tt-metal/tt_metal/python_env/requirements-dev.txt (line 30)) (from versions: none)
ERROR: No matching distribution found for torch==2.2.1.0+cpu (from -r /home/marty/Documents/tt-metal/tt_metal/python_env/requirements-dev.txt (line 30))

To Reproduce Steps to reproduce the behavior:

  1. Delete local package cache and existing venv.
  2. Run create_venv.sh
  3. Observe the error

Expected behavior Successful at creating the enviroment

Screenshots If applicable, add screenshots to help explain your problem.

Please complete the following environment information:

Additional context Add any other context about the problem here.

dmakoviichuk-tt commented 2 months ago

Could you check if using latest pip fixes this issue, As I see you are downgrading it.

Collecting pip==20.1.1
  Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.2
    Uninstalling pip-22.0.2:
      Successfully uninstalled pip-22.0.2
marty1885 commented 2 months ago

As I see you are downgrading it.

@dmakoviichuk-tt The pip==20.1.1 is enforced by create_venv.sh https://github.com/tenstorrent/tt-metal/blob/f188d4528457dc3e8d33f8867bd02880a0aea2fc/create_venv.sh#L28-L29

Disabling installing old pip and did a clean install works! However now importing Torch leads to an error on NumPy version.

Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> print(numpy.__file__)
/home/marty/Documents/tt-metal/python_env/lib/python3.10/site-packages/numpy/__init__.py
>>> print(numpy.version.full_version)
2.0.1
>>> import torch

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<stdin>", line 1, in <module>
  File "/home/marty/Documents/tt-metal/python_env/lib/python3.10/site-packages/torch/__init__.py", line 1477, in <module>
    from .functional import *  # noqa: F403
  File "/home/marty/Documents/tt-metal/python_env/lib/python3.10/site-packages/torch/functional.py", line 9, in <module>
    import torch.nn.functional as F
  File "/home/marty/Documents/tt-metal/python_env/lib/python3.10/site-packages/torch/nn/__init__.py", line 1, in <module>
    from .modules import *  # noqa: F403
  File "/home/marty/Documents/tt-metal/python_env/lib/python3.10/site-packages/torch/nn/modules/__init__.py", line 35, in <module>
    from .transformer import TransformerEncoder, TransformerDecoder, \
  File "/home/marty/Documents/tt-metal/python_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 20, in <module>
    device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
/home/marty/Documents/tt-metal/python_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:20: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
  device: torch.device = torch.device(torch._C._get_default_device()),  # torch.device('cpu'),
marty1885 commented 2 months ago

@dmakoviichuk-tt is there any information I can provide to help create a solution to this situation?

marty1885 commented 2 months ago

I don't know what changed but I am able to create a working env now... Closing

marty1885 commented 2 months ago

NVM, reopen, I replicated the issue by wiping existing cache and env.

marty1885 commented 2 months ago

Hi, I figured out something fun. create_venv,sh forces install of pip==20.1.1 but I have pip==24.2 locally. 20.1.1 Fails to fuid and download PyTorch while 24.2 does download and install - the TTNN installation is broken. Importing ttnn does absolutely nothing and ttnn.__file__ is None.

To install TTNN I have to do the following crazy maneuver.

I've uploaded a video showing the bug. The file is too large to be uploaded as an attachment to GitHub. Feel free to contact me for the source footage. https://youtu.be/6Z0k0nHk5nE

Update: I can replicate the issue on Arch Linux (with much effort due to distro differences). it is not a Ubuntu only issue. More then likely something changed in pip or torch.

namhyeong-kim commented 2 months ago

It still fails to install pytorch in TT-cloud ubuntu 22.04 virtual machine.

JushBJJ commented 2 months ago

Having the same issue on my internal 22.04 docker builds, like what Marty said before, removing this line works:

 pip install --force-reinstall pip==20.1.1

I think we have to bump the min pip version to around pip==22.0.2, that's what worked for me.

tt-rkim commented 1 month ago

Sorry for the delay and the confusion. I'll put an explanation here on why we enforce this version.

tl;dr: Has to do with editable installs.

For reasons I don't fully understand yet, pip versions lower than 22.0 seem to not do editable installs the way we think of them in previous pip versions, causing import errors in a development environment. What I think is specifically the problem is the .egg-link file is not created in the virtual environment's packages for metal-libs for higher pip versions.

This is important in development because our developers depend on editable installs to be working. This is because they want to be able to make changes in the Python code (irrelevant for C++) and see the results immediately. pip install -e . is how you do this, which is called editable mode.

I think the relevant PEP that will help here: https://peps.python.org/pep-0660/. I'll be investigating this more with the team.

We made this change as part of this PR (bumped up to 21 later after testing more pip versions): https://github.com/tenstorrent/tt-metal/pull/10751

As an unblocker for now while we figure this out, I would recommend, when invoking create_venv.sh:

  1. Delete the force install of the lower-version pip
  2. Build the stack with build_metal.sh + invoke create_venv.sh
  3. Set PYTHONPATH to <repo-dir>:<repo-dir>/ttnn

Then "editable" install should work. I will lower this to P2 for now.

Let us know if you any further questions.

If you guys have any suggestions, please feel free to offer them. Even if these problems didn't happen for you guys, I don't like that we have to pin the pip version. I would like to solve this, as well.

tt-rkim commented 1 month ago

One of our engineers may have found the commit that changed editable installs. We are going to be looking at this as part of our Python upgrade: https://github.com/tenstorrent/tt-metal/pull/10841#issuecomment-2345016782

TT-billteng commented 1 month ago

Using pip==21.2.4 works on my 22.04 VM, and also in 20.04 CI!

(python_env) ubuntu@tt-metal-billteng-2204-n300:~/tt-metal$ python3
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> import ttnn
2024-09-14 01:58:20.883 | DEBUG    | ttnn:<module>:82 - Initial ttnn.CONFIG:
Config{cache_path=/home/ubuntu/.cache/ttnn,model_cache_path=/home/ubuntu/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}
2024-09-14 01:58:22.018 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-09-14 01:58:22.020 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
2024-09-14 01:58:22.028 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-14 01:58:22.032 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.reshape be migrated to C++?
2024-09-14 01:58:22.032 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-09-14 01:58:22.032 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.squeeze be migrated to C++?
2024-09-14 01:58:22.032 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.from_torch be migrated to C++?
2024-09-14 01:58:22.033 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.to_torch be migrated to C++?
2024-09-14 01:58:22.033 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.to_device be migrated to C++?
2024-09-14 01:58:22.033 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.from_device be migrated to C++?
2024-09-14 01:58:22.033 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-09-14 01:58:22.033 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-09-14 01:58:22.033 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.deallocate be migrated to C++?
2024-09-14 01:58:22.033 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.reallocate be migrated to C++?
2024-09-14 01:58:22.033 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.load_tensor be migrated to C++?
2024-09-14 01:58:22.034 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.dump_tensor be migrated to C++?
2024-09-14 01:58:22.034 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.as_tensor be migrated to C++?
2024-09-14 01:58:22.036 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-14 01:58:22.039 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.conv2d be migrated to C++?
2024-09-14 01:58:22.039 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.avg_pool2d be migrated to C++?
2024-09-14 01:58:22.040 | WARNING  | ttnn.decorators:operation_decorator:768 - Should ttnn.Conv1d be migrated to C++?
>>> import torch
>>>