Closed tianyu-l closed 1 month ago
agree, using 2.3.0 is a bit misleading as, torchtitan will not work with torch 2.3.0 (many of the features/apis used by torchtitan were updated/added to torch recently.
we could freeze this to torch 2.4.0 after the release if we want to 'release torchtitan stably', and leave it there for some time. (but our CI would still need to install latest nightly to allow development).
Also, it's a bit unfortunate that we couple together what the docker build builds and what the end-user installs. We actually don't want torch installed in the docker, since ultimately we'll have to uninstall torch and reinstall the latest nightly every time we run the test.
One avenue here might be to just remove torch from requirements.txt, and do an import with try/except in torchtitan that gives some helpful info about whether the torch is missing (install it) or wrong version (which version is ok)?
Stack from ghstack (oldest at bottom):
torch 2.2.0.dev
is too stale to be useful. For CI, since we will install nightly anyway, this avoids storing the old version in the docker image.update: per @wanchaol's comment, adding 2.3.0 as it's the lastest stable version.
some comments:
2.4.0.dev
there requires--index-url https://download.pytorch.org/whl/nightly/cu121
. We don't want to specifycu121
since different people might have different cuda support. However, if we remove that, as @awgu previously explored, it will try to download everything and then select. So we'd rather let user install latest nightly.torch
version like 2.3.0 will installtriton
, whereas installing a nightly 2.4.0.dev version will installpytorch-triton
. This potentially might be the reason which caused CI failure when we removetorch
dependency inrequirements.txt
.