Open TJ-Solergibert opened 8 months ago
I've just find out that it works IF YOU INSTALL the dependencies as point 1 of this post. I've run the following to set up the environment:
pip install "torch==2.1.2" tensorboard
python -m pip install .
pip uninstall transformer-engine # I got errors, I'm working with A100s
pip install --upgrade \
"transformers==4.38.2" \
"datasets==2.16.1" \
"accelerate==0.26.1" \
"evaluate==0.4.1" \
"bitsandbytes==0.42.0" \
"trl==0.7.11" \
"peft==0.8.2"
pip install ninja packaging
MAX_JOBS=4 pip install flash-attn --no-build-isolation --upgrade
And the complete list of dependencies:
absl-py 2.0.0
accelerate 0.26.1
aiohttp 3.8.5
aiosignal 1.3.1
alignment-handbook 0.4.0.dev0
annotated-types 0.5.0
apex 0.1
appdirs 1.4.4
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
asttokens 2.4.0
astunparse 1.6.3
async-timeout 4.0.3
attrs 23.1.0
audioread 3.0.1
backcall 0.2.0
beautifulsoup4 4.12.2
bitsandbytes 0.42.0
bleach 6.0.0
blis 0.7.11
cachetools 5.3.1
catalogue 2.0.10
certifi 2023.7.22
cffi 1.16.0
charset-normalizer 3.2.0
click 8.1.6
cloudpathlib 0.15.1
cloudpickle 2.2.1
cmake 3.27.6
comm 0.1.4
confection 0.1.3
contourpy 1.1.1
cubinlinker 0.3.0+2.gce0680b
cuda-python 12.2.0rc5+5.g84845d1
cudf 23.8.0
cugraph 23.8.0
cugraph-dgl 23.8.0
cugraph-service-client 23.8.0
cugraph-service-server 23.8.0
cuml 23.8.0
cupy-cuda12x 12.1.0
cycler 0.12.1
cymem 2.0.8
Cython 3.0.3
dask 2023.7.1
dask-cuda 23.8.0
dask-cudf 23.8.0
datasets 2.16.1
debugpy 1.8.0
decorator 5.1.1
deepspeed 0.12.2
defusedxml 0.7.1
dill 0.3.7
distributed 2023.7.1
dm-tree 0.1.8
docker-pycreds 0.4.0
docstring-parser 0.15
einops 0.7.0
evaluate 0.4.1
exceptiongroup 1.1.3
execnet 2.0.2
executing 2.0.0
expecttest 0.1.3
fastjsonschema 2.18.1
fastrlock 0.8.1
filelock 3.12.4
flash-attn 2.5.6
fonttools 4.43.1
frozenlist 1.4.0
fsspec 2023.6.0
gast 0.5.4
gitdb 4.0.11
GitPython 3.1.40
google-auth 2.23.2
google-auth-oauthlib 0.4.6
graphsurgeon 0.4.6
grpcio 1.59.0
hf_transfer 0.1.6
hjson 3.1.0
huggingface-hub 0.21.4
hypothesis 5.35.1
idna 3.4
importlib-metadata 6.8.0
iniconfig 2.0.0
intel-openmp 2021.4.0
ipykernel 6.25.2
ipython 8.16.1
ipython-genutils 0.2.0
ipywidgets 8.1.1
jedi 0.19.1
Jinja2 3.1.2
joblib 1.3.2
json5 0.9.14
jsonschema 4.19.1
jsonschema-specifications 2023.7.1
jupyter 1.0.0
jupyter_client 8.3.1
jupyter-console 6.6.3
jupyter_core 5.3.2
jupyter-tensorboard 0.2.0
jupyterlab 2.3.2
jupyterlab-pygments 0.2.2
jupyterlab-server 1.2.0
jupyterlab-widgets 3.0.9
jupytext 1.15.2
kiwisolver 1.4.5
langcodes 3.3.0
librosa 0.9.2
lit 17.0.6
llvmlite 0.40.1
locket 1.0.0
Markdown 3.4.4
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib 3.8.0
matplotlib-inline 0.1.6
mdit-py-plugins 0.4.0
mdurl 0.1.2
mistune 3.0.2
mkl 2021.1.1
mkl-devel 2021.1.1
mkl-include 2021.1.1
mock 5.1.0
mpmath 1.3.0
msgpack 1.0.5
multidict 6.0.4
multiprocess 0.70.15
munch 4.0.0
murmurhash 1.0.10
nbclient 0.8.0
nbconvert 7.9.2
nbformat 5.9.2
nest-asyncio 1.5.8
networkx 2.6.3
ninja 1.11.1.1
notebook 6.4.10
numba 0.57.1+1.g5fba9aa8f
numpy 1.26.4
nvfuser 0.0.20+gitunknown
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.2.10.91
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.4.91
nvidia-cusparse-cu12 12.1.0.106
nvidia-dali-cuda120 1.30.0
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu11 11.7.91
nvidia-nvtx-cu12 12.1.105
nvidia-pyindex 1.0.9
nvtx 0.2.5
oauthlib 3.2.2
onnx 1.14.0
opencv 4.7.0
packaging 23.1
pandas 1.5.3
pandocfilters 1.5.0
parso 0.8.3
partd 1.4.0
pathy 0.10.2
peft 0.8.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.2.0
pip 23.3.2
platformdirs 3.11.0
pluggy 1.3.0
ply 3.11
polygraphy 0.49.0
pooch 1.7.0
preshed 3.0.9
prettytable 3.9.0
prometheus-client 0.17.1
prompt-toolkit 3.0.39
protobuf 3.20.2
psutil 5.9.4
ptxcompiler 0.8.1+1.g2cb1b35
ptyprocess 0.7.0
pure-eval 0.2.2
py-cpuinfo 9.0.0
pyarrow 11.0.0
pyarrow-hotfix 0.6
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.11.1
pybind11-global 2.11.1
pycocotools 2.0+nv0.7.3
pycparser 2.21
pydantic 1.10.13
pydantic_core 2.10.1
Pygments 2.16.1
pylibcugraph 23.8.0
pylibcugraphops 23.8.0
pylibraft 23.8.0
pynvml 11.4.1
pyparsing 3.1.1
pytest 7.4.2
pytest-flakefinder 1.1.0
pytest-rerunfailures 12.0
pytest-shard 0.1.2
pytest-xdist 3.3.1
python-dateutil 2.8.2
python-hostlist 1.23.0
pytorch-quantization 2.1.2
pytz 2023.3
PyYAML 6.0.1
pyzmq 25.1.1
qtconsole 5.5.1
QtPy 2.4.1
raft-dask 23.8.0
referencing 0.30.2
regex 2023.10.3
requests 2.31.0
requests-oauthlib 1.3.1
resampy 0.4.2
responses 0.18.0
rich 13.7.1
rmm 23.8.0
rpds-py 0.10.4
rsa 4.9
safetensors 0.4.2
scikit-learn 1.2.0
scipy 1.11.1
seaborn 0.13.1
Send2Trash 1.8.2
sentencepiece 0.1.99
sentry-sdk 1.39.1
setproctitle 1.3.3
setuptools 69.0.3
shtab 1.7.1
six 1.16.0
smart-open 6.4.0
smmap 5.0.1
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 2.5
spacy 3.7.1
spacy-legacy 3.0.12
spacy-loggers 1.0.5
sphinx-glpi-theme 0.3
srsly 2.4.8
stack-data 0.6.3
sympy 1.12
tabulate 0.9.0
tbb 2021.10.0
tblib 2.0.0
tensorboard 2.9.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorrt 8.6.1
terminado 0.17.1
thinc 8.2.1
threadpoolctl 3.2.0
thriftpy2 0.4.16
tinycss2 1.2.1
tokenizers 0.15.2
toml 0.10.2
tomli 2.0.1
toolz 0.12.0
torch 2.1.2
tornado 6.3.3
tqdm 4.66.1
traitlets 5.9.0
transformers 4.38.2
treelite 3.2.0
treelite-runtime 3.2.0
triton 2.1.0
trl 0.7.11
typer 0.9.0
types-dataclasses 0.6.6
typing_extensions 4.7.1
tyro 0.7.3
ucx-py 0.33.0
uff 0.6.9
urllib3 1.26.16
wandb 0.16.1
wasabi 1.1.2
wcwidth 0.2.8
weasel 0.3.2
webencodings 0.5.1
Werkzeug 3.0.0
wheel 0.41.2
widgetsnbextension 4.0.9
xdoctest 1.0.2
xgboost 1.7.5
xxhash 3.4.1
yarl 1.9.2
zict 3.0.0
zipp 3.16.2
I'm not able to run Zephyr 7B Gemma with 4 80GB A100s. I get the following error:
After running:
As can be seen, I've just modified
num_processes
and I testedzero3_init_flag: false
I've seen this related issue, (#57), but none of the solutions work.
Hope we find a solution soon for the members of the 4 GPU cluster club! 🤗