pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.28k stars 115 forks source link

replace old torch dependency in requirements.txt #372

Closed tianyu-l closed 1 month ago

tianyu-l commented 1 month ago

Stack from ghstack (oldest at bottom):

torch 2.2.0.dev is too stale to be useful. For CI, since we will install nightly anyway, this avoids storing the old version in the docker image.

update: per @wanchaol's comment, adding 2.3.0 as it's the lastest stable version.

some comments:

  1. Putting 2.4.0.dev there requires --index-url https://download.pytorch.org/whl/nightly/cu121. We don't want to specify cu121 since different people might have different cuda support. However, if we remove that, as @awgu previously explored, it will try to download everything and then select. So we'd rather let user install latest nightly.
  2. Installing a stable torch version like 2.3.0 will install triton, whereas installing a nightly 2.4.0.dev version will install pytorch-triton. This potentially might be the reason which caused CI failure when we remove torch dependency in requirements.txt.
wconstab commented 1 month ago

agree, using 2.3.0 is a bit misleading as, torchtitan will not work with torch 2.3.0 (many of the features/apis used by torchtitan were updated/added to torch recently.

we could freeze this to torch 2.4.0 after the release if we want to 'release torchtitan stably', and leave it there for some time. (but our CI would still need to install latest nightly to allow development).

Also, it's a bit unfortunate that we couple together what the docker build builds and what the end-user installs. We actually don't want torch installed in the docker, since ultimately we'll have to uninstall torch and reinstall the latest nightly every time we run the test.

One avenue here might be to just remove torch from requirements.txt, and do an import with try/except in torchtitan that gives some helpful info about whether the torch is missing (install it) or wrong version (which version is ok)?