Closed L1aoXingyu closed 1 year ago
Maybe this is the race condition in multi-node training when using the NFS.
Hey @L1aoXingyu, I think you're right that this is a bug when doing autoresume from NFS that is shared by all nodes. If you specify the load_path
directly (rather than using autoresume: true
), does the run resume ok?
I use load_path
but get the following error
BTW, my max_duration: 1ep
Starting training...
137 Traceback (most recent call last):
138 File "/share_nfs/chengpeng/open-llm-foundry/scripts/train/train.py", line 320, in <module>
139 main(cfg)
140 File "/share_nfs/chengpeng/open-llm-foundry/scripts/train/train.py", line 309, in main
141 trainer.fit()
142 File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1804, in fit
143 self._train_loop()
144 File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1925, in _train_loop
145 self._spin_dataloaders_to_cur_epoch()
146 File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1875, in _spin_dataloaders_to_cur_epoch
147 for _ in dataloader:
148 File "/usr/lib/python3/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__
149 data = self._next_data()
150 File "/usr/lib/python3/dist-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
151 return self._process_data(data)
152 File "/usr/lib/python3/dist-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
153 data.reraise()
154 File "/usr/lib/python3/dist-packages/torch/_utils.py", line 542, in reraise
155 raise RuntimeError(msg) from None
156 RuntimeError: Caught JSONDecodeError in DataLoader worker process 0.
157 Original Traceback (most recent call last):
158 File "/usr/lib/python3/dist-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
159 data = fetcher.fetch(index)
160 File "/usr/lib/python3/dist-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
161 data.append(next(self.dataset_iter))
162 File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 1265, in __iter__
163 epoch, sample_in_epoch = self._resume_incr_epoch(world)
164 File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 589, in _resume_incr_epoch
165 epoch, sample_in_epoch = self._resume(world, presumed_epoch)
166 File "/usr/lib/python3/dist-packages/streaming/base/dataset.py", line 555, in _resume
167 obj = json.loads(buf.decode('utf-8'))
168 File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
169 return _default_decoder.decode(s)
170 File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
171 raise JSONDecodeError("Extra data", s, end)
172 json.decoder.JSONDecodeError: Extra data: line 1 column 84 (char 83)
@dakinggg Any suggestion?
Could you share your yaml for training? The original one, and the new one for resuming.
my training yaml is the same in origin run and resuming run
data_local: /dataset/home/liaoxingyu/datasets
data_remote: # If blank, files must be present in data_local
max_seq_len: 2048
global_seed: 17
# Run Name
# <data>-gpt-<#params>-<precision>-<arch>-<#bsz>-<#ctxlen>-<#tok>-<#nodes>-<cluster-name>-<etc>
# run_name: py_java_js-gpt-1.1b-amp_bf16-MQA_flash-gbsz192-ctxlen2048-tokn118b-wmup2000ba
run_name: test_resume
# Model
model:
name: mpt_causal_lm
init_device: meta
emb_pdrop: 0.1
resid_pdrop: 0.1
d_model: 2048
expansion_ratio: 4
n_heads: 16 # Modified 24->16 so that d_head == 128 to statisfy FlashAttention
n_layers: 24
max_seq_len: ${max_seq_len}
vocab_size: 49280
multiquery_attention: ture
attn_config:
attn_impl: flash
attn_pdrop: 0.1
# Tokenizer
tokenizer:
name: /dataset/home/liaoxingyu/models/starcoderbase
kwargs:
model_max_length: ${max_seq_len}
# Dataloaders
train_loader:
name: text
dataset:
shuffle: true
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
fim_rate: 0.5
# data mixture
streams:
python:
local: ${data_local}/my-copy-the-stack-python-v1
proportion: 0.2
split: train
java:
local: ${data_local}/my-copy-the-stack-java-v1
proportion: 0.25
split: train
javascript:
local: ${data_local}/my-copy-the-stack-javascript-v1
proportion: 0.55
split: train
drop_last: true
num_workers: 0
persistent_workers: false
eval_loader:
name: text
dataset:
local: ${data_local}/my-copy-the-stack-java-v1
remote: ${data_remote}
split: val_small
shuffle: false
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
drop_last: false
num_workers: 8
# Optimization
scheduler:
name: cosine_with_warmup
t_warmup: 2000ba
alpha_f: 0.1
optimizer:
name: decoupled_adamw
lr: 2.0e-4
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0
algorithms:
gradient_clipping:
clipping_type: norm
clipping_threshold: 1.0
max_duration: 1ep # ~ 95B tokens
eval_interval: 30000ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 4
# System
seed: ${global_seed}
device_eval_batch_size: 16
# device_train_microbatch_size: 12
device_train_microbatch_size: auto
precision: amp_bf16
# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: false
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false
# Logging
progress_bar: true
log_to_console: false
console_log_interval: 20ba
callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}
runtime_estimator: {}
# loggers:
# wandb:
# project: llm-foundry-test
# Checkpoint to local filesystem or remote object store
save_interval: 30ba
save_num_checkpoints_to_keep: 100 # Important, this cleans up checkpoints saved to DISK
save_folder: ./output/${run_name}/checkpoints
autoresume: true
# Load from local filesystem or remote object store
# load_path: ./output/${run_name}/checkpoints/latest-rank0.pt
# load_path: s3://my-bucket/my-folder/gpt-125m/checkpoints/latest-rank{rank}.pt
And I reproduce the error with just 1-gpu training. Shed some light on this problem? @dakinggg
I just found, if I autoresume from a certain epoch, it can success. For example, I train 100 ep, and save every 10 ep. After 40 ep, I kill the training, then I autoresume, the model can train from 40 ep.
But I pretrain LLM, usually I only train 1ep. The program may crash at 4000ba, and I want to autoresume from it. This will cause the error below.
I just found the root cause. The error is from this file streaming/base/dataset.py
# Get the resume state, if it exists.
name = _get_path(self._shm_prefix_int, RESUME)
try:
shm = SharedMemory(name=name, create=False)
except FileNotFoundError:
# There is nothing to resume.
if not self.num_canonical_nodes:
self.num_canonical_nodes = world.num_nodes * 64
self._set_predownload()
return epoch, 0
# SharedMemory buffers may contain additional null bytes at the end.
buf = bytes(shm.buf)
index = buf.find(b'\0')
buf = buf[:index] if index != -1 else buf
I print buf
and got the following result
b'{"epoch": 0, "num_canonical_nodes": 64, "sample_in_epoch": 0, "shuffle_seed": 17}7}'
It seems that the remaining space could not be filled with null bytes. So I just change the line to find index as following
index = buf.find(b'}')
buf = buf[:index+1] if index != -1 else buf
Then I can resume successfully. So I want to know if this is a bug need to be fixed.
Maybe the root cause above is due to my mistake. I just met error about local directory reuse. Then I just fix as follow
def build_dataloader(cfg, tokenizer, device_batch_size):
if cfg.name == 'text':
streaming.base.util.clean_stale_shared_memory()
return build_text_dataloader(
cfg,
tokenizer,
device_batch_size,
)
When I use train_loading and test_loader simutaneously, the train dataset shared memory is overrided by test dataset shared memory. When I resume training, it will cause the buf
problem due to the order of loading. Train dataset is loaded first and then test loader, then train loader's buf
is changed when loading test loader's buf
.
So I think load_path
has no extra problems and autoresume
remain the same problem when using NAS.
@L1aoXingyu we've fixed this issue in Composer dev https://github.com/mosaicml/composer/pull/2363
Thanks for reporting it! I'm going to close this issue for now. If this is blocking, you can try installing from source on the following commit 15e1b0439d3ad0c3ddb7e2c2cbbda7f424b4b702
but fair warning dev might be unstable at times.
Environment
To reproduce
Steps to reproduce the behavior:
1. 2. 3.
Expected behavior
I use autoresume, but got this error
Maybe the reason is all node access the same saved folder.
Additional context