Open TJ-Solergibert opened 8 months ago
I've just find out that it works IF YOU INSTALL the dependencies as point 1 of this post. I've run the following to set up the environment:
pip install "torch==2.1.2" tensorboard
python -m pip install .
pip uninstall transformer-engine # I got errors, I'm working with A100s
pip install --upgrade \
"transformers==4.38.2" \
"datasets==2.16.1" \
"accelerate==0.26.1" \
"evaluate==0.4.1" \
"bitsandbytes==0.42.0" \
"trl==0.7.11" \
pip install ninja packaging
MAX_JOBS=4 pip install flash-attn --no-build-isolation --upgrade
I'm not able to run Zephyr 7B Gemma with 4 80GB A100s. I get the following error:
After running:
As can be seen, I've just modified
and I testedzero3_init_flag: false
I've seen this related issue, (#57), but none of the solutions work.
Hope we find a solution soon for the members of the 4 GPU cluster club! 🤗