Closed soonjune closed 1 year ago
Hi @soonjune,
I'm not sure about the immediate cause of your run hanging, but I think I see one issue with your YAML, you should be using hf_causal_lm
rather than mpt_causal_lm
, in order to load an HF checkpoint (whether that is mosaicml/mpt-7b, or any HF model).
See here in the finetunign example: https://github.com/mosaicml/llm-foundry/blob/9027f49153d89e6b0b225af3626311a9b4658dbf/scripts/train/finetune_example/mpt-7b-arc-easy--gpu.yaml#L7-L23
I would also recommend using a specific microbatch size like device_train_microbatch_size: 4
, rather than auto
, as the latter relies on catching OOMs that I believe currently only works on NVIDIA cards.
I would also recommend using attn_impl: torch
for finetuning our MPT-7B model, as we are still integrating ALiBi support into the AMD + FlashAttention stack. As soon as we complete testing of triton
-based FlashAttention on AMD, you'll be able to use attn_impl:triton
just like we do on NVIDIA cards.
Thank you for the reply. I found that the hang was due to the rocm version of the host device. Hope AMD support comes out soon!
I am trying to reproduce the experiment mentioned in https://www.mosaicml.com/blog/amd-mi250
There is a related issue but the result provided here does not help me reproducing the experiment.
The goal is to finetune mpt-7b to mpt-7b-instruct in AMD 4xMI250. When I run the command composer train.py /path/to/yaml_file, my gpu utilization is 100% but there seems no progress in the training. Can you guys share your yaml file or more specific steps to reproduce the experiment? The message I get at the end is
Environment
To reproduce
Steps to reproduce the behavior:
Run Name
run_name: # If left blank, will be read from env var $RUN_NAME
model: name: mpt_causal_lm pretrained: true pretrained_model_name_or_path: /root/llm-foundry/mpt-7b max_seq_len: ${max_seq_len} attn_config: attn_impl: flash
Set this to
true
if usingtrain_loader.dataset.packing_ratio
belowloss_fn: torch_crossentropy
Tokenizer
tokenizer: name: /root/llm-foundry/mpt-7b kwargs: model_max_length: ${max_seq_len}
Dataloaders
train_loader: name: finetuning dataset: hf_name: json hf_kwargs: data_dir: /root/llm-foundry/data split: train max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true
Use
python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ...
drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0
eval_loader: name: finetuning dataset: hf_name: json hf_kwargs: data_dir: /root/llm-foundry/data split: test max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true
packing_ratio:
drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0
Optimization
scheduler: name: linear_decay_with_warmup # linear no warmup is HF default which dolly used t_warmup: 50ba # add some warmup though, seems to help with MPT alpha_f: 0
optimizer:
Based on Dolly
name: decoupled_adamw lr: 5.0e-6 betas:
algorithms: gradient_clipping: clipping_type: norm clipping_threshold: 1.0
max_duration: 2ep # 2-3 epochs seems like the sweet spot eval_interval: 1ep
eval_subset_num_batches: -1
eval_first: true global_train_batch_size: 512 # somewhere in the 6-8 * numgpus range seems good
System
seed: ${global_seed} device_eval_batch_size: 8
device_train_microbatch_size: 4
device_train_microbatch_size: auto cprecision: amp_bf16
FSDP
fsdp_config: sharding_strategy: FULL_SHARD mixed_precision: PURE activation_checkpointing: false activation_checkpointing_reentrant: false activation_cpu_offload: false limit_all_gathers: true verbose: false
Logging
progress_bar: true log_to_console: true console_log_interval: 1ba
callbacks: speed_monitor: window_size: 10 gpu_flops_available: true lr_monitor: {} memory_monitor: {} runtime_estimator: {}
cd ~/llm-foundry/scripts/train PYTHONPATH=$PWD composer train.py finetune_example/mpt-7b-arc-easy--gpu.yaml