mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
4.04k stars 526 forks source link

AutoResume get error when multi-node training using NAS #417

Closed L1aoXingyu closed 1 year ago

L1aoXingyu commented 1 year ago

Environment

Collecting system information...
---------------------------------
System Environment Report        
Created: 2023-07-03 11:26:38 CST
---------------------------------

PyTorch information
-------------------
PyTorch version: 2.0.1+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.26.3
Libc version: glibc-2.27

Python version: 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.4.119-19-0009.11-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA Graphics Device
GPU 1: NVIDIA Graphics Device
GPU 2: NVIDIA Graphics Device
GPU 3: NVIDIA Graphics Device
GPU 4: NVIDIA Graphics Device
GPU 5: NVIDIA Graphics Device
GPU 6: NVIDIA Graphics Device
GPU 7: NVIDIA Graphics Device

Nvidia driver version: 470.82.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              124
On-line CPU(s) list: 0-123
Thread(s) per core:  2
Core(s) per socket:  31
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8374C CPU @ 2.70GHz
Stepping:            6
CPU MHz:             2693.648
BogoMIPS:            5387.29
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            55296K
NUMA node0 CPU(s):   0-61
NUMA node1 CPU(s):   62-123
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.1
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.0.1+cu118
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.0.2+cu118
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.15.2
[pip3] torchvision==0.15.2+cu118
[conda] numpy                     1.24.1                   pypi_0    pypi
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch                     2.0.1+cu118              pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchaudio                2.0.2+cu118              pypi_0    pypi
[conda] torchdata                 0.6.1                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchtext                 0.15.2                   pypi_0    pypi
[conda] torchvision               0.15.2+cu118             pypi_0    pypi

Composer information
--------------------
Composer version: 0.15.0
Composer commit hash: None
Host processor model name: Intel(R) Xeon(R) Platinum 8374C CPU @ 2.70GHz
Host processor core count: 62
Number of nodes: 1
Accelerator model name: NVIDIA Graphics Device
Accelerators per node: 1
CUDA Device Count: 8

To reproduce

Steps to reproduce the behavior:

1. 2. 3.

Expected behavior

I use autoresume, but got this error

───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /share_nfs/open-llm-foundry/scripts/train/train.py:320 in <module> │
│                                                                              │
│   317 │   │   yaml_cfg = om.load(f)                                          │
│   318 │   cli_cfg = om.from_cli(args_list)                                   │
│   319 │   cfg = om.merge(yaml_cfg, cli_cfg)                                  │
│ ❱ 320 │   main(cfg)                                                          │
│   321                                                                        │
│                                                                              │
│ /share_nfs/open-llm-foundry/scripts/train/train.py:262 in main     │
│                                                                              │
│   259 │                                                                      │
│   260 │   # Build the Trainer                                                │
│   261 │   print('Building trainer...')                                       │
│ ❱ 262 │   trainer = Trainer(                                                 │
│   263 │   │   run_name=cfg.run_name,                                         │
│   264 │   │   seed=cfg.seed,                                                 │
│   265 │   │   model=model,                                                   │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/trainer/trainer.py:1326 in __init__  │
│                                                                              │
│   1323 │   │   │   │   │   'Multiple concurrent uploads is not currently sup │
│   1324 │   │   │   │   │   'for all `RemoteUploaderDownloader` instances.')  │
│   1325 │   │   │   assert latest_remote_file_name is not None                │
│ ❱ 1326 │   │   │   autoresume_checkpoint_path = self._get_autoresume_checkpo │
│   1327 │   │   │   │   save_folder=save_folder,                              │
│   1328 │   │   │   │   save_latest_filename=save_latest_filename,            │
│   1329 │   │   │   │   save_latest_remote_file_name=latest_remote_file_name, │
│                                                                              │
│ /usr/lib/python3/dist-packages/composer/trainer/trainer.py:1508 in           │
│ _get_autoresume_checkpoint                                                   │
│                                                                              │
│   1505 │   │   │   │   │   │   │   │   │   │   │   '.local_rank0_completed_a │
│   1506 │   │   │   if dist.get_local_rank() == 0:                            │
│   1507 │   │   │   │   os.makedirs(os.path.dirname(signal_file_path), exist_ │
│ ❱ 1508 │   │   │   │   with open(signal_file_path, 'wb') as f:               │
│   1509 │   │   │   │   │   f.write(b'local_rank0_completed_autoresume')      │
│   1510 │   │   │                                                             │
│   1511 │   │   │   # Avoid the collective call until the local rank zero has │
╰──────────────────────────────────────────────────────────────────────────────╯
FileExistsError: [Errno 17] File exists: 
'/share_nfs/model/.local_rank0_completed_autoresume'

Maybe the reason is all node access the same saved folder.

Additional context

L1aoXingyu commented 1 year ago

Maybe this is the race condition in multi-node training when using the NFS.

dakinggg commented 1 year ago

Hey @L1aoXingyu, I think you're right that this is a bug when doing autoresume from NFS that is shared by all nodes. If you specify the load_path directly (rather than using autoresume: true), does the run resume ok?

L1aoXingyu commented 1 year ago

I use load_path but get the following error BTW, my max_duration: 1ep

Starting training...
137 Traceback (most recent call last):
138   File "/share_nfs/chengpeng/open-llm-foundry/scripts/train/train.py", line 320, in <module>
139     main(cfg)
140   File "/share_nfs/chengpeng/open-llm-foundry/scripts/train/train.py", line 309, in main
141     trainer.fit()
142   File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1804, in fit
143     self._train_loop()
144   File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1925, in _train_loop
145     self._spin_dataloaders_to_cur_epoch()
146   File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1875, in _spin_dataloaders_to_cur_epoch
147     for _ in dataloader:
148   File "/usr/lib/python3/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
149     data = self._next_data()
150   File "/usr/lib/python3/dist-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
151     return self._process_data(data)
152   File "/usr/lib/python3/dist-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
153     data.reraise()
154   File "/usr/lib/python3/dist-packages/torch/_utils.py", line 542, in reraise
155     raise RuntimeError(msg) from None
156 RuntimeError: Caught JSONDecodeError in DataLoader worker process 0.
157 Original Traceback (most recent call last):
158   File "/usr/lib/python3/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
159     data = fetcher.fetch(index)
160   File "/usr/lib/python3/dist-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
161     data.append(next(self.dataset_iter))
162   File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 1265, in __iter__
163     epoch, sample_in_epoch = self._resume_incr_epoch(world)
164   File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 589, in _resume_incr_epoch
165     epoch, sample_in_epoch = self._resume(world, presumed_epoch)
166   File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 555, in _resume
167     obj = json.loads(buf.decode('utf-8'))
168   File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
169     return _default_decoder.decode(s)
170   File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
171     raise JSONDecodeError("Extra data", s, end)
172 json.decoder.JSONDecodeError: Extra data: line 1 column 84 (char 83)

@dakinggg Any suggestion?

dakinggg commented 1 year ago

Could you share your yaml for training? The original one, and the new one for resuming.

L1aoXingyu commented 1 year ago

my training yaml is the same in origin run and resuming run

data_local: /dataset/home/liaoxingyu/datasets
data_remote: # If blank, files must be present in data_local
max_seq_len: 2048
global_seed: 17

# Run Name
# <data>-gpt-<#params>-<precision>-<arch>-<#bsz>-<#ctxlen>-<#tok>-<#nodes>-<cluster-name>-<etc>
# run_name: py_java_js-gpt-1.1b-amp_bf16-MQA_flash-gbsz192-ctxlen2048-tokn118b-wmup2000ba
run_name: test_resume

# Model
model:
  name: mpt_causal_lm
  init_device: meta
  emb_pdrop: 0.1
  resid_pdrop: 0.1
  d_model: 2048
  expansion_ratio: 4
  n_heads: 16 # Modified 24->16 so that d_head == 128 to statisfy FlashAttention
  n_layers: 24
  max_seq_len: ${max_seq_len}
  vocab_size: 49280
  multiquery_attention: ture
  attn_config:
    attn_impl: flash
    attn_pdrop: 0.1

# Tokenizer
tokenizer:
  name: /dataset/home/liaoxingyu/models/starcoderbase
  kwargs:
    model_max_length: ${max_seq_len}

# Dataloaders
train_loader:
  name: text
  dataset:
    shuffle: true
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
    fim_rate: 0.5
    # data mixture
    streams:
      python:
        local: ${data_local}/my-copy-the-stack-python-v1
        proportion: 0.2
        split: train
      java:
        local: ${data_local}/my-copy-the-stack-java-v1
        proportion: 0.25
        split: train
      javascript:
        local: ${data_local}/my-copy-the-stack-javascript-v1
        proportion: 0.55
        split: train
  drop_last: true
  num_workers: 0
  persistent_workers: false

eval_loader:
  name: text
  dataset:
    local: ${data_local}/my-copy-the-stack-java-v1
    remote: ${data_remote}
    split: val_small
    shuffle: false
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
  drop_last: false
  num_workers: 8

# Optimization
scheduler:
  name: cosine_with_warmup
  t_warmup: 2000ba
  alpha_f: 0.1

optimizer:
  name: decoupled_adamw
  lr: 2.0e-4
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0

algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0

max_duration: 1ep # ~ 95B tokens
eval_interval: 30000ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 4

# System
seed: ${global_seed}
device_eval_batch_size: 16
# device_train_microbatch_size: 12
device_train_microbatch_size: auto
precision: amp_bf16

# FSDP
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: false
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true
  verbose: false

# Logging
progress_bar: true
log_to_console: false
console_log_interval: 20ba

callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}

# loggers:
#   wandb:
#     project: llm-foundry-test

# Checkpoint to local filesystem or remote object store
save_interval: 30ba
save_num_checkpoints_to_keep: 100 # Important, this cleans up checkpoints saved to DISK
save_folder: ./output/${run_name}/checkpoints

autoresume: true
# Load from local filesystem or remote object store
# load_path: ./output/${run_name}/checkpoints/latest-rank0.pt
# load_path: s3://my-bucket/my-folder/gpt-125m/checkpoints/latest-rank{rank}.pt

And I reproduce the error with just 1-gpu training. Shed some light on this problem? @dakinggg

L1aoXingyu commented 1 year ago

I just found, if I autoresume from a certain epoch, it can success. For example, I train 100 ep, and save every 10 ep. After 40 ep, I kill the training, then I autoresume, the model can train from 40 ep.

L1aoXingyu commented 1 year ago

But I pretrain LLM, usually I only train 1ep. The program may crash at 4000ba, and I want to autoresume from it. This will cause the error below.

L1aoXingyu commented 1 year ago

I just found the root cause. The error is from this file streaming/base/dataset.py

# Get the resume state, if it exists.
name = _get_path(self._shm_prefix_int, RESUME)
try:
    shm = SharedMemory(name=name, create=False)
except FileNotFoundError:
    # There is nothing to resume.
    if not self.num_canonical_nodes:
        self.num_canonical_nodes = world.num_nodes * 64
    self._set_predownload()
    return epoch, 0

# SharedMemory buffers may contain additional null bytes at the end.
buf = bytes(shm.buf)
index = buf.find(b'\0')
buf = buf[:index] if index != -1 else buf

I print buf and got the following result

b'{"epoch": 0, "num_canonical_nodes": 64, "sample_in_epoch": 0, "shuffle_seed": 17}7}'

It seems that the remaining space could not be filled with null bytes. So I just change the line to find index as following

index = buf.find(b'}')
buf = buf[:index+1] if index != -1 else buf

Then I can resume successfully. So I want to know if this is a bug need to be fixed.

L1aoXingyu commented 1 year ago

Maybe the root cause above is due to my mistake. I just met error about local directory reuse. Then I just fix as follow

def build_dataloader(cfg, tokenizer, device_batch_size):
    if cfg.name == 'text':
        streaming.base.util.clean_stale_shared_memory()
        return build_text_dataloader(
            cfg,
            tokenizer,
            device_batch_size,
        )

When I use train_loading and test_loader simutaneously, the train dataset shared memory is overrided by test dataset shared memory. When I resume training, it will cause the buf problem due to the order of loading. Train dataset is loaded first and then test loader, then train loader's buf is changed when loading test loader's buf.

So I think load_path has no extra problems and autoresume remain the same problem when using NAS.

mvpatel2000 commented 1 year ago

@L1aoXingyu we've fixed this issue in Composer dev https://github.com/mosaicml/composer/pull/2363

Thanks for reporting it! I'm going to close this issue for now. If this is blocking, you can try installing from source on the following commit 15e1b0439d3ad0c3ddb7e2c2cbbda7f424b4b702 but fair warning dev might be unstable at times.