modelscope / ms-swift

Use PEFT or Full-parameter to finetune 350+ LLMs or 90+ MLLMs. (Qwen2.5, GLM4v, Internlm2.5, Yi, Llama3.1, Llava-Video, Internvl2, MiniCPM-V-2.6, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
3.49k stars 299 forks source link

llava1d6-mistral-7b-instruct DDP模式微调失败 #587

Closed Alxemade closed 6 months ago

Alxemade commented 6 months ago

hi, 您好,请问一下我用DDP微调llava1d6-mistral-7b-instruct这个好像报错了,但是非DDP是可以的;

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NNODES=1 \
NODE_RANK=0 \
MASTER_ADDR=x.x.x.x \
NPROC_PER_NODE=8 \
swift sft \
    --model_type llava1d6-mistral-7b-instruct \
    --dataset coco-mini-en-2 \

报错为:

File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 2: 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 ...

如果设置ddp_find_unused_parameters true 好像是报另外一个问题。

但是这样跑是可以的:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft \
     --model_type llava1d6-mistral-7b-instruct \
     --dataset coco-mini-en-2 

包的环境是:

absl-py                       1.4.0
accelerate                    0.22.0
adaseq                        0.6.4
addict                        2.4.0
aiofiles                      23.2.1
aiohttp                       3.8.5
aiosignal                     1.3.1
albumentations                1.3.1
aliyun-python-sdk-core        2.13.36
aliyun-python-sdk-kms         2.16.1
altair                        5.2.0
aniso8601                     9.0.1
annotated-types               0.5.0
antlr4-python3-runtime        4.9.3
anyio                         3.7.1
apex                          0.1
appdirs                       1.4.4
argon2-cffi                   23.1.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.3
astropy                       5.2.2
asttokens                     2.2.1
astunparse                    1.6.3
async-lru                     2.0.4
async-timeout                 4.0.3
attrs                         23.1.0
audioread                     3.0.0
av                            10.0.0
Babel                         2.12.1
backcall                      0.2.0
backports.zoneinfo            0.2.1
basicsr                       1.4.2
beautifulsoup4                4.12.2
bidict                        0.22.1
biopython                     1.81
bitarray                      2.8.1
bitsandbytes                  0.41.1
bitstring                     4.1.1
black                         23.7.0
bleach                        6.0.0
blis                          0.7.10
blobfile                      2.0.2
bmt-clipit                    1.0
boltons                       23.0.0
boto3                         1.28.37
botocore                      1.31.37
Bottleneck                    1.3.7
cachetools                    5.3.1
catalogue                     2.0.9
certifi                       2023.7.22
cffi                          1.15.1
cfgv                          3.4.0
charset-normalizer            2.0.4
chumpy                        0.70
cityscapesScripts             2.2.2
click                         8.1.7
clip                          1.0
cloudpickle                   2.2.1
cmake                         3.27.2
colorama                      0.4.6
coloredlogs                   14.0
comm                          0.1.4
conda                         23.7.2
conda-content-trust           0+unknown
conda-libmamba-solver         23.7.0
conda-package-handling        1.9.0
confection                    0.1.1
ConfigArgParse                1.7
contextlib2                   21.6.0
contourpy                     1.1.0
control-ldm                   0.0.1
cpm-kernels                   1.0.11
crcmod                        1.7
cryptography                  41.0.2
cycler                        0.11.0
cymem                         2.0.7
Cython                        0.29.36
dacite                        1.8.1
dataclasses                   0.6
datasets                      2.18.0
ddpm-guided-diffusion         0.0.0
debugpy                       1.6.7.post1
decorator                     4.4.2
decord                        0.6.0
deepspeed                     0.10.1
defusedxml                    0.7.1
descartes                     1.1.0
detectron2                    0.6
dgl                           1.1.1+cu118
diffusers                     0.25.0
dill                          0.3.6
Distance                      0.1.3
distlib                       0.3.7
dnspython                     2.3.0
docstring_parser              0.16
easydict                      1.10
easyrobust                    0.2.4
edit-distance                 1.0.6
editdistance                  0.6.2
einops                        0.6.1
embeddings                    0.0.8
emoji                         2.8.0
espnet-tts-frontend           0.0.3
et-xmlfile                    1.1.0
eventlet                      0.33.3
exceptiongroup                1.1.3
executing                     1.2.0
expecttest                    0.1.6
face-alignment                1.4.1
fairscale                     0.4.13
fairseq                       0.12.2
faiss                         1.7.2
fastai                        2.7.12
fastapi                       0.110.0
fastcore                      1.5.29
fastdownload                  0.0.7
fastjsonschema                2.18.0
fastprogress                  1.0.3
fasttext                      0.9.2
ffmpeg                        1.4
ffmpeg-python                 0.2.0
ffmpy                         0.3.2
filelock                      3.12.2
fire                          0.5.0
flake8                        6.1.0
Flask                         2.2.5
Flask-Cors                    4.0.0
Flask-RESTful                 0.3.10
Flask-SocketIO                5.3.5
flask-talisman                1.1.0
flatbuffers                   23.5.26
fonttools                     4.42.1
fqdn                          1.5.1
frozenlist                    1.4.0
fsspec                        2023.6.0
ftfy                          6.1.1
funasr                        0.7.5
funtextprocessing             0.1.1
future                        0.18.3
fvcore                        0.1.5.post20221221
g2p                           1.1.20230822
g2p-en                        2.1.0
gast                          0.4.0
google-auth                   2.22.0
google-auth-oauthlib          1.0.0
google-pasta                  0.2.0
gradio                        4.22.0
gradio_client                 0.13.0
greenlet                      2.0.2
grpcio                        1.57.0
h11                           0.14.0
h5py                          3.9.0
hdbscan                       0.8.33
healpy                        1.16.5
hjson                         3.1.0
httpcore                      1.0.4
httpx                         0.27.0
huggingface-hub               0.21.4
humanfriendly                 10.0
hydra-core                    1.3.2
HyperPyYAML                   1.2.1
identify                      2.5.27
idna                          3.4
imageio                       2.31.2
imageio-ffmpeg                0.4.8
imgaug                        0.4.0
importlib-metadata            6.8.0
importlib-resources           6.0.1
inflect                       7.0.0
iniconfig                     2.0.0
iopath                        0.1.9
ipdb                          0.13.13
ipykernel                     6.25.1
ipython                       8.12.2
ipython-genutils              0.2.0
ipywidgets                    8.1.0
isoduration                   20.11.0
isort                         5.12.0
itsdangerous                  2.1.2
jaconv                        0.3.4
jamo                          0.4.1
jedi                          0.19.0
jieba                         0.42.1
Jinja2                        3.1.2
jmespath                      0.10.0
joblib                        1.3.2
json-tricks                   3.17.3
json5                         0.9.14
jsonpatch                     1.32
jsonplus                      0.8.0
jsonpointer                   2.1
jsonschema                    4.19.0
jsonschema-specifications     2023.7.1
jupyter                       1.0.0
jupyter_client                8.3.1
jupyter-console               6.6.3
jupyter_core                  5.3.1
jupyter-events                0.7.0
jupyter-lsp                   2.2.0
jupyter_server                2.7.2
jupyter_server_terminals      0.4.4
jupyterlab                    4.0.5
jupyterlab-pygments           0.2.2
jupyterlab_server             2.24.0
jupyterlab-widgets            3.0.8
kaldi-io                      0.9.8
kaldiio                       2.18.0
kantts                        1.0.1
keras                         2.13.1
kiwisolver                    1.4.5
kornia                        0.7.0
kwsbp                         0.0.6
langcodes                     3.3.0
lap                           0.4.0
libclang                      16.0.6
libmambapy                    1.4.1
librosa                       0.9.2
lightning-utilities           0.9.0
lit                           16.0.6
llvmlite                      0.40.1
lmdb                          1.4.1
lpips                         0.1.4
lxml                          4.9.3
lyft-dataset-sdk              0.0.8
Markdown                      3.4.4
markdown-it-py                3.0.0
MarkupSafe                    2.1.3
matplotlib                    3.5.2
matplotlib-inline             0.1.6
mccabe                        0.7.0
mdurl                         0.1.2
megatron-util                 1.3.2
MinDAEC                       0.0.2
mir-eval                      0.7
mistune                       3.0.1
ml-collections                0.1.1
mmcls                         0.25.0
mmcv-full                     1.7.0
mmdet                         2.28.2
mmdet3d                       1.0.0a1
mmsegmentation                0.30.0
mock                          5.1.0
modelscope                    1.13.1
moviepy                       1.0.3
mpi4py                        3.1.4
mpmath                        1.3.0
ms-swift                      1.8.0.dev0           /cloudfs-data/visiondata/xuchao/vehicle_data/mycode_vehicle/swift-main
msgpack                       1.0.5
multidict                     6.0.4
multiprocess                  0.70.14
MultiScaleDeformableAttention 1.0
munkres                       1.1.4
murmurhash                    1.0.9
mypy-extensions               1.0.0
nara-wpe                      0.0.9
nbclient                      0.8.0
nbconvert                     7.8.0
nbformat                      5.9.2
nerfacc                       0.2.2
nest-asyncio                  1.5.7
networkx                      2.8.4
ninja                         1.11.1
nltk                          3.8.1
nodeenv                       1.8.0
notebook                      7.0.2
notebook_shim                 0.2.3
numba                         0.57.1
numexpr                       2.8.5
numpy                         1.24.3
nuscenes-devkit               1.1.10
nvdiffrast                    0.3.1
oauthlib                      3.2.2
omegaconf                     2.3.0
onnx                          1.14.1
onnxruntime                   1.15.1
onnxsim                       0.4.33
open-clip-torch               2.20.0
opencv-python                 4.8.0.76
opencv-python-headless        4.8.0.76
openpyxl                      3.1.2
opt-einsum                    3.3.0
optimum                       1.17.1
orjson                        3.9.15
oss2                          2.18.1
overrides                     7.4.0
packaging                     23.0
pai-easycv                    0.11.4
paint-ldm                     0.0.0
pandas                        2.0.3
pandocfilters                 1.5.0
panopticapi                   0.1
panphon                       0.20.0
parso                         0.8.3
pathspec                      0.11.2
pathy                         0.10.2
peft                          0.9.0
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        10.0.0
pip                           23.2.1
pkgutil_resolve_name          1.3.10
platformdirs                  3.10.0
plotly                        5.16.1
pluggy                        1.0.0
plyfile                       1.0.1
pointnet2                     0.0.0
pooch                         1.7.0
portalocker                   2.7.0
pre-commit                    3.3.3
preshed                       3.0.8
prettytable                   3.8.0
proglog                       0.1.10
prometheus-client             0.17.1
prompt-toolkit                3.0.39
protobuf                      3.20.0
psutil                        5.9.5
ptflops                       0.7
ptyprocess                    0.7.0
pure-eval                     0.2.2
py-cpuinfo                    9.0.0
py-sound-connect              0.2.1
pyarrow                       13.0.0
pyarrow-hotfix                0.6
pyasn1                        0.5.0
pyasn1-modules                0.3.0
pybind11                      2.11.1
pyclipper                     1.3.0.post4
pycocoevalcap                 1.2
pycocotools                   2.0.7
pycodestyle                   2.11.0
pycosat                       0.6.4
pycparser                     2.21
pycryptodome                  3.18.0
pycryptodomex                 3.18.0
pydantic                      1.10.7
pydantic_core                 2.16.3
pyDeprecate                   0.3.2
pydot                         1.4.2
pydub                         0.25.1
pyerfa                        2.0.0.3
pyflakes                      3.1.0
Pygments                      2.16.1
PyMCubes                      0.1.4
pynini                        2.1.5
pynndescent                   0.5.10
pyOpenSSL                     23.2.0
pyparsing                     3.0.9
pypinyin                      0.49.0
pyquaternion                  0.9.9
PySocks                       1.7.1
pysptk                        0.1.18
pytest                        7.4.0
pythainlp                     4.0.2
python-crfsuite               0.9.9
python-dateutil               2.8.2
python-engineio               4.6.1
python-json-logger            2.0.7
python-multipart              0.0.9
python-socketio               5.8.0
pytorch-lightning             1.7.7
pytorch-metric-learning       2.3.0
pytorch-wavelets              1.3.0
pytorch-wpe                   0.0.1
pytorch3d                     0.7.4
pytz                          2023.3
pyvi                          0.1.1
PyWavelets                    1.4.1
PyYAML                        6.0.1
pyzmq                         25.1.1
qtconsole                     5.4.3
QtPy                          2.4.0
qudida                        0.0.4
rapidfuzz                     3.2.0
referencing                   0.30.2
regex                         2023.8.8
requests                      2.31.0
requests-oauthlib             1.3.1
resampy                       0.4.2
rfc3339-validator             0.1.4
rfc3986-validator             0.1.1
rich                          13.5.2
rotary-embedding-torch        0.2.7
rouge                         1.0.1
rouge-score                   0.0.4
rpds-py                       0.10.0
rsa                           4.9
ruamel.yaml                   0.17.21
ruamel.yaml.clib              0.2.6
ruff                          0.3.3
s3transfer                    0.6.2
sacrebleu                     2.3.1
sacremoses                    0.0.53
safetensors                   0.4.2
scikit-image                  0.19.3
scikit-learn                  1.3.0
scipy                         1.10.1
seaborn                       0.12.2
semantic-version              2.10.0
Send2Trash                    1.8.2
sentencepiece                 0.1.99
seqeval                       1.2.2
setuptools                    68.0.0
Shapely                       1.8.4
shellingham                   1.5.4
shotdetect-scenedetect-lgss   0.0.4
shtab                         1.7.1
simplejson                    3.19.1
six                           1.16.0
sklearn-crfsuite              0.3.6
smart-open                    6.3.0
smplx                         0.1.28
sniffio                       1.3.0
sortedcontainers              2.4.0
soundfile                     0.12.1
soupsieve                     2.4.1
sox                           1.4.1
spacy                         3.6.1
spacy-legacy                  3.0.12
spacy-loggers                 1.0.4
speechbrain                   0.5.15
srsly                         2.4.7
stack-data                    0.6.2
stanza                        1.5.0
starlette                     0.36.3
subword-nmt                   0.3.8
sympy                         1.12
tabulate                      0.9.0
taming-transformers-rom1504   0.0.6
tb-nightly                    2.14.0a20230808
tenacity                      8.2.3
tensorboard                   2.13.0
tensorboard-data-server       0.7.1
tensorboardX                  2.6.2
tensorflow                    2.13.0
tensorflow-estimator          2.13.0
tensorflow-io-gcs-filesystem  0.33.0
termcolor                     2.3.0
terminado                     0.17.1
terminaltables                3.1.10
text-unidecode                1.3
text2sql-lgesql               1.3.0
TextGrid                      1.5
tf-slim                       1.1.0
thinc                         8.1.12
thop                          0.1.1.post2209072238
threadpoolctl                 3.2.0
tifffile                      2023.7.10
tiktoken                      0.5.1
timm                          0.5.4
tinycss2                      1.2.1
tinycudann                    1.6
tokenizers                    0.15.2
tomli                         2.0.1
tomlkit                       0.12.0
toolz                         0.12.0
torch                         2.0.1+cu118
torch-complex                 0.4.3
torch-scatter                 2.1.1
torchaudio                    2.0.2+cu118
torchmetrics                  0.11.4
torchsummary                  1.5.1
torchvision                   0.15.2+cu118
tornado                       6.3.3
tqdm                          4.65.0
traitlets                     5.9.0
transformers                  4.37.2
transformers-stream-generator 0.0.4
trimesh                       2.35.39
triton                        2.0.0
trl                           0.8.1
ttsfrd                        0.2.1
typeguard                     2.13.3
typer                         0.9.0
typing                        3.7.4.3
typing_extensions             4.10.0
tyro                          0.7.3
tzdata                        2023.3
ujson                         5.8.0
umap-learn                    0.5.3
unicodecsv                    0.14.1
unicodedata2                  15.0.0
unicore                       0.0.1
Unidecode                     1.3.6
uri-template                  1.3.0
urllib3                       1.26.16
utils                         1.0.1
uvicorn                       0.29.0
videofeatures-clipit          1.0
virtualenv                    20.24.3
wasabi                        1.1.2
wcwidth                       0.2.6
webcolors                     1.13
webencodings                  0.5.1
websocket-client              1.6.2
websockets                    11.0.3
wenetruntime                  1.11.0
Werkzeug                      2.2.3
wget                          3.2
wheel                         0.38.4
widgetsnbextension            4.0.8
wrapt                         1.15.0
xtcocotools                   1.13
xxhash                        3.3.0
yacs                          0.1.8
yapf                          0.30.0
yarl                          1.9.2
zhconv                        1.4.3
zipp                          3.16.2
Jintao-Huang commented 6 months ago

可以复现

Jintao-Huang commented 6 months ago

可以加上 --deepspeed default-zero2 解决

Jintao-Huang commented 6 months ago

或者加上 --ddp_find_unused_parameters true 可以解决

Alxemade commented 6 months ago

--ddp_find_unused_parameters true 设置成这样,好像是会有另外一个问题。

File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3211) of binary: /opt/conda/bin/python

设置成--deepspeed default-zero2 是可以正常工作的。

感谢作者的耐心回复。