princeton-nlp / AutoCompressors

[EMNLP 2023] Adapting Language Models to Compress Long Contexts
https://arxiv.org/abs/2305.14788
275 stars 20 forks source link

torchrun error when generating training split #24

Open OswaldHe opened 3 months ago

OswaldHe commented 3 months ago

When I try to run run/train.sh for OPT-2.7b, it generates the training split for the first 5813 samples, then exit immediately without any error log.

Generating train split:   7%|▋         | 5813/81380 [00:35<03:31, 357.02 examples/s]E0731 23:14:13.108000 140299780256832 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 488431) of binary: /home/oswaldhe/miniconda3/envs/autocompressor/bin/python
Traceback (most recent call last):
  File "/home/oswaldhe/miniconda3/envs/autocompressor/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I'm running on NVIDIA-A100 40GB PCIe. What could be the possible issue? Thank you.

CodeCreator commented 3 months ago

Sorry, I'm not sure what the issue is and it might be related to your setup (e.g., disk space, RAM). Are there any additional error messages?

OswaldHe commented 3 months ago

Thank you for your response. I tried to increase the RAM size to 50GB and it can generate training split now. However, when it starts training, it raises a wandb related error:

[WARNING|integration_utils.py:81] 2024-08-01 00:17:04,944 >> Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/oswaldhe/AutoCompressors/train.py", line 286, in <module>
[rank0]:     main()
[rank0]:   File "/home/oswaldhe/AutoCompressors/train.py", line 226, in main
[rank0]:     trainer = SubstepTrainer(
[rank0]:   File "/home/oswaldhe/AutoCompressors/substep_trainer.py", line 69, in __init__
[rank0]:     super().__init__(model,
[rank0]:   File "/home/oswaldhe/AutoCompressors/base_trainer.py", line 138, in __init__
[rank0]:     super().__init__(model, args, *more_args, **kwargs)
[rank0]:   File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer.py", line 557, in __init__
[rank0]:     self.callback_handler = CallbackHandler(
[rank0]:   File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer_callback.py", line 305, in __init__
[rank0]:     self.add_callback(cb)
[rank0]:   File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/trainer_callback.py", line 322, in add_callback
[rank0]:     cb = callback() if isinstance(callback, type) else callback
[rank0]:   File "/home/oswaldhe/miniconda3/envs/autocompressor/lib/python3.10/site-packages/transformers/integrations/integration_utils.py", line 673, in __init__
[rank0]:     raise RuntimeError("WandbCallback requires wandb to be installed. Run `pip install wandb`.")
[rank0]: RuntimeError: WandbCallback requires wandb to be installed. Run `pip install wandb`.

I already install wandb. Here are all packages I installed with the corresponding versions:

absl-py==1.4.0
accelerate==0.24.1
aiohttp==3.8.5
aiosignal==1.3.1
altair==5.0.1
array-record==0.4.1
async-timeout==4.0.3
attributedict==0.3.0
attrs==23.2.0
audioread==3.0.0
autobridge==0.0.20220512.dev1
blessings==1.7
cached-property==1.5.2
cachetools==5.3.1
certifi==2024.7.4
cffi==1.15.1
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.4
cmake==3.27.2
codecov==2.1.13
colorama==0.4.6
coloredlogs==15.0.1
colour-runner==0.1.1
conllu==4.5.3
contourpy==1.1.0
coverage==7.3.0
cycler==0.11.0
DataProperty==1.0.1
datasets==2.14.0
decorator==5.1.1
deepdiff==6.3.1
dill==0.3.7
distlib==0.3.7
dm-tree==0.1.8
docker-pycreds==0.4.0
einops==0.8.0
elastic-transport==8.4.0
elasticsearch==8.9.0
etils==1.4.1
evaluate==0.4.0
exceptiongroup==1.1.3
fairscale==0.4.13
filelock==3.12.2
fire==0.5.0
flash-attn==2.6.2
fonttools==4.42.1
frozenlist==1.4.0
fsspec==2023.6.0
gensim==4.3.2
git-python==1.0.3
gitdb==4.0.10
GitPython==3.1.32
google-auth==2.22.0
google-auth-oauthlib==1.0.0
googleapis-common-protos==1.60.0
grpcio==1.57.0
haoda==0.0.20240228.dev1
huggingface-hub==0.17.3
humanfriendly==10.0
idna==3.7
importlib-resources==6.0.1
iniconfig==2.0.0
inspecta==0.1.3
Jinja2==3.1.2
jiwer==3.0.2
joblib==1.3.2
jsonlines==3.1.0
kiwisolver==1.4.5
lazy_loader==0.3
librosa==0.10.1
lit==16.0.6
llvmlite==0.40.1
lm-eval==0.3.0
Markdown==3.4.4
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib==3.7.2
mbstrdecoder==1.1.3
mdurl==0.1.2
mip==1.15.0
mpmath==1.3.0
msgpack==1.0.5
multidict==6.0.4
multiprocess==0.70.15
networkx==3.1
nltk==3.8.1
numba==0.57.1
numexpr==2.8.5
numpy==1.24.4
nvidia-cublas-cu11==11.10.3.66
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu11==8.5.0.96
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu11==10.2.10.91
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu11==11.7.4.91
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu11==2.14.3
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.82
nvidia-nvtx-cu11==11.7.91
nvidia-nvtx-cu12==12.1.105
openai==0.27.9
ordered-set==4.1.0
packaging==23.1
pandas==2.0.3
pathvalidate==3.1.0
peft==0.12.0
Pillow==10.0.0
platformdirs==3.10.0
pluggy==1.2.0
ply==3.11
pooch==1.7.0
portalocker==2.7.0
prettytable==3.8.0
promise==2.3
protobuf==5.27.3
psutil==5.9.5
pyarrow==12.0.1
pybind11==2.11.1
pycountry==22.3.5
pycparser==2.21
pydeck==0.8.0
Pympler==1.0.1
pyproject-api==1.5.4
pytablewriter==1.0.0
pytest==7.4.0
python-dateutil==2.8.2
pytz==2024.1
pytz-deprecation-shim==0.1.0.post0
pyverilog==1.3.0
PyYAML==6.0
rapidfuzz==2.13.7
regex==2023.8.8
requests==2.32.3
requests-oauthlib==1.3.1
responses==0.18.0
rich==13.5.2
rootpath==0.1.1
rouge-score==0.1.2
rsa==4.9
sacrebleu==1.5.0
safetensors==0.4.3
scikit-learn==1.3.0
scipy==1.11.2
sentencepiece==0.1.99
sentry-sdk==2.12.0
seqeval==1.2.2
setproctitle==1.3.3
six==1.16.0
smart-open==6.3.0
smmap==5.0.0
soundfile==0.12.1
soxr==0.3.6
sqlitedict==2.1.0
streamlit==1.26.0
sympy==1.12
tabledata==1.3.1
tapa-fast-cosim==0.0.20220816.dev1
tcolorpy==0.1.3
tenacity==8.2.3
tensorboard==2.14.0
tensorboard-data-server==0.7.1
tensorflow-datasets==4.9.2
tensorflow-metadata==1.14.0
termcolor==2.3.0
texttable==1.6.7
threadpoolctl==3.2.0
tokenizers==0.14.1
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
toposort==1.10
torch==2.4.0
torchvision==0.19.0
tox==4.10.0
tqdm==4.66.1
tqdm-multiprocess==0.0.11
transformers==4.34.0
triton==3.0.0
typepy==1.3.1
typing_extensions==4.12.2
tzdata==2023.3
tzlocal==4.3.1
urllib3==2.2.2
validators==0.21.2
virtualenv==20.24.3
wandb==0.17.5
watchdog==3.0.0
wcwidth==0.2.6
Werkzeug==2.3.7
wrapt==1.15.0
xxhash==3.3.0
yarl==1.9.2
zstandard==0.21.0
super-wuliao commented 3 months ago

I've encountered this issue as well. It seems to be a problem with insufficient memory on your end, not related to the GPU.