huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.67k stars 407 forks source link

Not able to run Zephyr 7B Gemma with 4 80GB A100s #132

Open TJ-Solergibert opened 8 months ago

TJ-Solergibert commented 8 months ago

I'm not able to run Zephyr 7B Gemma with 4 80GB A100s. I get the following error:

RuntimeError: The size of tensor a (0) must match the size of tensor b (24576) at non-singleton dimension 1

After running:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_sft.py recipes/zephyr-7b-gemma/sft/config_full.yaml

As can be seen, I've just modified num_processes and I tested zero3_init_flag: false

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I've seen this related issue, (#57), but none of the solutions work.

Hope we find a solution soon for the members of the 4 GPU cluster club! 🤗

TJ-Solergibert commented 8 months ago

I've just find out that it works IF YOU INSTALL the dependencies as point 1 of this post. I've run the following to set up the environment:

pip install "torch==2.1.2" tensorboard
python -m pip install .
pip uninstall transformer-engine # I got errors, I'm working with A100s
pip install  --upgrade \
  "transformers==4.38.2" \
  "datasets==2.16.1" \
  "accelerate==0.26.1" \
  "evaluate==0.4.1" \
  "bitsandbytes==0.42.0" \
  "trl==0.7.11" \
  "peft==0.8.2"

pip install ninja packaging
MAX_JOBS=4 pip install flash-attn --no-build-isolation --upgrade

And the complete list of dependencies:

absl-py                   2.0.0
accelerate                0.26.1
aiohttp                   3.8.5
aiosignal                 1.3.1
alignment-handbook        0.4.0.dev0
annotated-types           0.5.0
apex                      0.1
appdirs                   1.4.4
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
asttokens                 2.4.0
astunparse                1.6.3
async-timeout             4.0.3
attrs                     23.1.0
audioread                 3.0.1
backcall                  0.2.0
beautifulsoup4            4.12.2
bitsandbytes              0.42.0
bleach                    6.0.0
blis                      0.7.11
cachetools                5.3.1
catalogue                 2.0.10
certifi                   2023.7.22
cffi                      1.16.0
charset-normalizer        3.2.0
click                     8.1.6
cloudpathlib              0.15.1
cloudpickle               2.2.1
cmake                     3.27.6
comm                      0.1.4
confection                0.1.3
contourpy                 1.1.1
cubinlinker               0.3.0+2.gce0680b
cuda-python               12.2.0rc5+5.g84845d1
cudf                      23.8.0
cugraph                   23.8.0
cugraph-dgl               23.8.0
cugraph-service-client    23.8.0
cugraph-service-server    23.8.0
cuml                      23.8.0
cupy-cuda12x              12.1.0
cycler                    0.12.1
cymem                     2.0.8
Cython                    3.0.3
dask                      2023.7.1
dask-cuda                 23.8.0
dask-cudf                 23.8.0
datasets                  2.16.1
debugpy                   1.8.0
decorator                 5.1.1
deepspeed                 0.12.2
defusedxml                0.7.1
dill                      0.3.7
distributed               2023.7.1
dm-tree                   0.1.8
docker-pycreds            0.4.0
docstring-parser          0.15
einops                    0.7.0
evaluate                  0.4.1
exceptiongroup            1.1.3
execnet                   2.0.2
executing                 2.0.0
expecttest                0.1.3
fastjsonschema            2.18.1
fastrlock                 0.8.1
filelock                  3.12.4
flash-attn                2.5.6
fonttools                 4.43.1
frozenlist                1.4.0
fsspec                    2023.6.0
gast                      0.5.4
gitdb                     4.0.11
GitPython                 3.1.40
google-auth               2.23.2
google-auth-oauthlib      0.4.6
graphsurgeon              0.4.6
grpcio                    1.59.0
hf_transfer               0.1.6
hjson                     3.1.0
huggingface-hub           0.21.4
hypothesis                5.35.1
idna                      3.4
importlib-metadata        6.8.0
iniconfig                 2.0.0
intel-openmp              2021.4.0
ipykernel                 6.25.2
ipython                   8.16.1
ipython-genutils          0.2.0
ipywidgets                8.1.1
jedi                      0.19.1
Jinja2                    3.1.2
joblib                    1.3.2
json5                     0.9.14
jsonschema                4.19.1
jsonschema-specifications 2023.7.1
jupyter                   1.0.0
jupyter_client            8.3.1
jupyter-console           6.6.3
jupyter_core              5.3.2
jupyter-tensorboard       0.2.0
jupyterlab                2.3.2
jupyterlab-pygments       0.2.2
jupyterlab-server         1.2.0
jupyterlab-widgets        3.0.9
jupytext                  1.15.2
kiwisolver                1.4.5
langcodes                 3.3.0
librosa                   0.9.2
lit                       17.0.6
llvmlite                  0.40.1
locket                    1.0.0
Markdown                  3.4.4
markdown-it-py            3.0.0
MarkupSafe                2.1.3
matplotlib                3.8.0
matplotlib-inline         0.1.6
mdit-py-plugins           0.4.0
mdurl                     0.1.2
mistune                   3.0.2
mkl                       2021.1.1
mkl-devel                 2021.1.1
mkl-include               2021.1.1
mock                      5.1.0
mpmath                    1.3.0
msgpack                   1.0.5
multidict                 6.0.4
multiprocess              0.70.15
munch                     4.0.0
murmurhash                1.0.10
nbclient                  0.8.0
nbconvert                 7.9.2
nbformat                  5.9.2
nest-asyncio              1.5.8
networkx                  2.6.3
ninja                     1.11.1.1
notebook                  6.4.10
numba                     0.57.1+1.g5fba9aa8f
numpy                     1.26.4
nvfuser                   0.0.20+gitunknown
nvidia-cublas-cu11        11.10.3.66
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu11    11.7.101
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu11         8.5.0.96
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu11         10.9.0.58
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu11        10.2.10.91
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu11      11.4.0.1
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu11      11.7.4.91
nvidia-cusparse-cu12      12.1.0.106
nvidia-dali-cuda120       1.30.0
nvidia-nccl-cu11          2.14.3
nvidia-nccl-cu12          2.18.1
nvidia-nvjitlink-cu12     12.4.99
nvidia-nvtx-cu11          11.7.91
nvidia-nvtx-cu12          12.1.105
nvidia-pyindex            1.0.9
nvtx                      0.2.5
oauthlib                  3.2.2
onnx                      1.14.0
opencv                    4.7.0
packaging                 23.1
pandas                    1.5.3
pandocfilters             1.5.0
parso                     0.8.3
partd                     1.4.0
pathy                     0.10.2
peft                      0.8.2
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    9.2.0
pip                       23.3.2
platformdirs              3.11.0
pluggy                    1.3.0
ply                       3.11
polygraphy                0.49.0
pooch                     1.7.0
preshed                   3.0.9
prettytable               3.9.0
prometheus-client         0.17.1
prompt-toolkit            3.0.39
protobuf                  3.20.2
psutil                    5.9.4
ptxcompiler               0.8.1+1.g2cb1b35
ptyprocess                0.7.0
pure-eval                 0.2.2
py-cpuinfo                9.0.0
pyarrow                   11.0.0
pyarrow-hotfix            0.6
pyasn1                    0.5.0
pyasn1-modules            0.3.0
pybind11                  2.11.1
pybind11-global           2.11.1
pycocotools               2.0+nv0.7.3
pycparser                 2.21
pydantic                  1.10.13
pydantic_core             2.10.1
Pygments                  2.16.1
pylibcugraph              23.8.0
pylibcugraphops           23.8.0
pylibraft                 23.8.0
pynvml                    11.4.1
pyparsing                 3.1.1
pytest                    7.4.2
pytest-flakefinder        1.1.0
pytest-rerunfailures      12.0
pytest-shard              0.1.2
pytest-xdist              3.3.1
python-dateutil           2.8.2
python-hostlist           1.23.0
pytorch-quantization      2.1.2
pytz                      2023.3
PyYAML                    6.0.1
pyzmq                     25.1.1
qtconsole                 5.5.1
QtPy                      2.4.1
raft-dask                 23.8.0
referencing               0.30.2
regex                     2023.10.3
requests                  2.31.0
requests-oauthlib         1.3.1
resampy                   0.4.2
responses                 0.18.0
rich                      13.7.1
rmm                       23.8.0
rpds-py                   0.10.4
rsa                       4.9
safetensors               0.4.2
scikit-learn              1.2.0
scipy                     1.11.1
seaborn                   0.13.1
Send2Trash                1.8.2
sentencepiece             0.1.99
sentry-sdk                1.39.1
setproctitle              1.3.3
setuptools                69.0.3
shtab                     1.7.1
six                       1.16.0
smart-open                6.4.0
smmap                     5.0.1
sortedcontainers          2.4.0
soundfile                 0.12.1
soupsieve                 2.5
spacy                     3.7.1
spacy-legacy              3.0.12
spacy-loggers             1.0.5
sphinx-glpi-theme         0.3
srsly                     2.4.8
stack-data                0.6.3
sympy                     1.12
tabulate                  0.9.0
tbb                       2021.10.0
tblib                     2.0.0
tensorboard               2.9.0
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
tensorrt                  8.6.1
terminado                 0.17.1
thinc                     8.2.1
threadpoolctl             3.2.0
thriftpy2                 0.4.16
tinycss2                  1.2.1
tokenizers                0.15.2
toml                      0.10.2
tomli                     2.0.1
toolz                     0.12.0
torch                     2.1.2
tornado                   6.3.3
tqdm                      4.66.1
traitlets                 5.9.0
transformers              4.38.2
treelite                  3.2.0
treelite-runtime          3.2.0
triton                    2.1.0
trl                       0.7.11
typer                     0.9.0
types-dataclasses         0.6.6
typing_extensions         4.7.1
tyro                      0.7.3
ucx-py                    0.33.0
uff                       0.6.9
urllib3                   1.26.16
wandb                     0.16.1
wasabi                    1.1.2
wcwidth                   0.2.8
weasel                    0.3.2
webencodings              0.5.1
Werkzeug                  3.0.0
wheel                     0.41.2
widgetsnbextension        4.0.9
xdoctest                  1.0.2
xgboost                   1.7.5
xxhash                    3.4.1
yarl                      1.9.2
zict                      3.0.0
zipp                      3.16.2