huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 121 forks source link

[Bug] Missing `_is_using_mup` when resume checkpoint #198

Open xrsrke opened 4 months ago

xrsrke commented 4 months ago

Can't resume checkpoint from a model that doesn't use mup.

Traceback (most recent call last):
  File "/fsx/phuc/projects/reference/nanotron/run_generate.py", line 255, in <module>
    main()
  File "/fsx/phuc/projects/reference/nanotron/run_generate.py", line 71, in main
    config = get_config_from_file((args.ckpt_path / "config.yaml").as_posix())
  File "/fsx/phuc/projects/reference/nanotron/src/nanotron/config/config.py", line 430, in get_config_from_file
    config = get_config_from_dict(
  File "/fsx/phuc/projects/reference/nanotron/src/nanotron/config/config.py", line 391, in get_config_from_dict
    return from_dict(
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/dacite/core.py", line 64, in from_dict
    value = _build_value(type_=field_type, data=field_data, config=config)
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/dacite/core.py", line 99, in _build_value
    data = from_dict(data_class=type_, data=data, config=config)
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/dacite/core.py", line 81, in from_dict
    instance = data_class(**init_values)
  File "<string>", line 8, in __init__
  File "/fsx/phuc/projects/reference/nanotron/src/nanotron/config/config.py", line 199, in __post_init__
    self.model_config._is_using_mup = isinstance(self.init_method, SpectralMupInit)
AttributeError: 'dict' object has no attribute '_is_using_mup'
[2024-06-14 03:38:04,551] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2747571) of binary: /fsx/phuc/projects/reference/env/bin/python
Traceback (most recent call last):
  File "/fsx/phuc/projects/reference/env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/fsx/phuc/projects/reference/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
run_generate.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-14_03:38:04
  host      : ip-26-0-160-103.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2747571)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
bpopeters commented 6 hours ago

I am also encountering this problem when I attempt to run run_generate.py.