Open OswaldHe opened 3 months ago
Sorry, I'm not sure what the issue is and it might be related to your setup (e.g., disk space, RAM). Are there any additional error messages?
Thank you for your response. I tried to increase the RAM size to 50GB and it can generate training split now. However, when it starts training, it raises a wandb related error:
[WARNING|integration_utils.py:81] 2024-08-01 00:17:04,944 >> Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/oswaldhe/AutoCompressors/train.py", line 286, in <module>
[rank0]: main()
[rank0]: File "/home/oswaldhe/AutoCompressors/train.py", line 226, in main
[rank0]: trainer = SubstepTrainer(
[rank0]: File "/home/oswaldhe/AutoCompressors/substep_trainer.py", line 69, in __init__
[rank0]: super().__init__(model,
[rank0]: File "/home/oswaldhe/AutoCompressors/base_trainer.py", line 138, in __init__
[rank0]: super().__init__(model, args, *more_args, **kwargs)
[rank0]: File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer.py", line 557, in __init__
[rank0]: self.callback_handler = CallbackHandler(
[rank0]: File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer_callback.py", line 305, in __init__
[rank0]: self.add_callback(cb)
[rank0]: File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer_callback.py", line 322, in add_callback
[rank0]: cb = callback() if isinstance(callback, type) else callback
[rank0]: File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 673, in __init__
[rank0]: raise RuntimeError("WandbCallback requires wandb to be installed. Run `pip install wandb`.")
[rank0]: RuntimeError: WandbCallback requires wandb to be installed. Run `pip install wandb`.
I already install wandb. Here are all packages I installed with the corresponding versions:
absl-py==1.4.0
accelerate==0.24.1
aiohttp==3.8.5
aiosignal==1.3.1
altair==5.0.1
array-record==0.4.1
async-timeout==4.0.3
attributedict==0.3.0
attrs==23.2.0
audioread==3.0.0
autobridge==0.0.20220512.dev1
blessings==1.7
cached-property==1.5.2
cachetools==5.3.1
certifi==2024.7.4
cffi==1.15.1
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.4
cmake==3.27.2
codecov==2.1.13
colorama==0.4.6
coloredlogs==15.0.1
colour-runner==0.1.1
conllu==4.5.3
contourpy==1.1.0
coverage==7.3.0
cycler==0.11.0
DataProperty==1.0.1
datasets==2.14.0
decorator==5.1.1
deepdiff==6.3.1
dill==0.3.7
distlib==0.3.7
dm-tree==0.1.8
docker-pycreds==0.4.0
einops==0.8.0
elastic-transport==8.4.0
elasticsearch==8.9.0
etils==1.4.1
evaluate==0.4.0
exceptiongroup==1.1.3
fairscale==0.4.13
filelock==3.12.2
fire==0.5.0
flash-attn==2.6.2
fonttools==4.42.1
frozenlist==1.4.0
fsspec==2023.6.0
gensim==4.3.2
git-python==1.0.3
gitdb==4.0.10
GitPython==3.1.32
google-auth==2.22.0
google-auth-oauthlib==1.0.0
googleapis-common-protos==1.60.0
grpcio==1.57.0
haoda==0.0.20240228.dev1
huggingface-hub==0.17.3
humanfriendly==10.0
idna==3.7
importlib-resources==6.0.1
iniconfig==2.0.0
inspecta==0.1.3
Jinja2==3.1.2
jiwer==3.0.2
joblib==1.3.2
jsonlines==3.1.0
kiwisolver==1.4.5
lazy_loader==0.3
librosa==0.10.1
lit==16.0.6
llvmlite==0.40.1
lm-eval==0.3.0
Markdown==3.4.4
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.7.2
mbstrdecoder==1.1.3
mdurl==0.1.2
mip==1.15.0
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.15
networkx==3.1
nltk==3.8.1
numba==0.57.1
numexpr==2.8.5
numpy==1.24.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu11==8.5.0.96
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu11==10.2.10.91
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu11==11.7.4.91
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu11==2.14.3
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu11==11.7.91
nvidia-nvtx-cu12==12.1.105
openai==0.27.9
ordered-set==4.1.0
packaging==23.1
pandas==2.0.3
pathvalidate==3.1.0
peft==0.12.0
Pillow==10.0.0
platformdirs==3.10.0
pluggy==1.2.0
ply==3.11
pooch==1.7.0
portalocker==2.7.0
prettytable==3.8.0
promise==2.3
protobuf==5.27.3
psutil==5.9.5
pyarrow==12.0.1
pybind11==2.11.1
pycountry==22.3.5
pycparser==2.21
pydeck==0.8.0
Pympler==1.0.1
pyproject-api==1.5.4
pytablewriter==1.0.0
pytest==7.4.0
python-dateutil==2.8.2
pytz==2024.1
pytz-deprecation-shim==0.1.0.post0
pyverilog==1.3.0
PyYAML==6.0
rapidfuzz==2.13.7
regex==2023.8.8
requests==2.32.3
requests-oauthlib==1.3.1
responses==0.18.0
rich==13.5.2
rootpath==0.1.1
rouge-score==0.1.2
rsa==4.9
sacrebleu==1.5.0
safetensors==0.4.3
scikit-learn==1.3.0
scipy==1.11.2
sentencepiece==0.1.99
sentry-sdk==2.12.0
seqeval==1.2.2
setproctitle==1.3.3
six==1.16.0
smart-open==6.3.0
smmap==5.0.0
soundfile==0.12.1
soxr==0.3.6
sqlitedict==2.1.0
streamlit==1.26.0
sympy==1.12
tabledata==1.3.1
tapa-fast-cosim==0.0.20220816.dev1
tcolorpy==0.1.3
tenacity==8.2.3
tensorboard==2.14.0
tensorboard-data-server==0.7.1
tensorflow-datasets==4.9.2
tensorflow-metadata==1.14.0
termcolor==2.3.0
texttable==1.6.7
threadpoolctl==3.2.0
tokenizers==0.14.1
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
toposort==1.10
torch==2.4.0
torchvision==0.19.0
tox==4.10.0
tqdm==4.66.1
tqdm-multiprocess==0.0.11
transformers==4.34.0
triton==3.0.0
typepy==1.3.1
typing_extensions==4.12.2
tzdata==2023.3
tzlocal==4.3.1
urllib3==2.2.2
validators==0.21.2
virtualenv==20.24.3
wandb==0.17.5
watchdog==3.0.0
wcwidth==0.2.6
Werkzeug==2.3.7
wrapt==1.15.0
xxhash==3.3.0
yarl==1.9.2
zstandard==0.21.0
I've encountered this issue as well. It seems to be a problem with insufficient memory on your end, not related to the GPU.
When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.
I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.