soonjune commented 9 months ago

I am trying to reproduce the experiment mentioned in https://www.mosaicml.com/blog/amd-mi250

There is a related issue but the result provided here does not help me reproducing the experiment.

The goal is to finetune mpt-7b to mpt-7b-instruct in AMD 4xMI250. When I run the command composer train.py /path/to/yaml_file, my gpu utilization is 100% but there seems no progress in the training. Can you guys share your yaml file or more specific steps to reproduce the experiment? The message I get at the end is

rank0[28980][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
Deterministic: False
Performance Mode: True
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=158, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 603984 milliseconds before timing out.

Environment

PyTorch version: 2.0.1+git8bfa463
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.6.31061-8c743ae5d

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.0 23243 be997b2f3651a41597d7a41441fff8ade4ac59ac)
CMake version: version 3.26.3
Libc version: glibc-2.31

Python version: 3.8.16 (default, Jun 12 2023, 18:09:05)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-113-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI250X/MI250
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.6.31061
MIOpen runtime version: 2.20.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7763 64-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         2506.523
CPU max MHz:                     2450.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4890.84
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-63
NUMA node1 CPU(s):               64-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] mypy==0.960
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.4
[pip3] onnx==1.14.0
[pip3] onnxruntime==1.15.1
[pip3] pytorch-ranger==0.1.1
[pip3] pytorch-triton-rocm==2.0.0
[pip3] torch==2.0.1+git8bfa463
[pip3] torch-optimizer==0.3.0
[pip3] torchmetrics==1.0.3
[pip3] torchvision==0.15.2a0+fa99a53
[pip3] triton==2.0.0
[pip3] triton-pre-mlir==2.0.0
[conda] No relevant packages

To reproduce

Steps to reproduce the behavior:

pull image rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_2.0.1
install flash-attention https://github.com/ROCmSoftwarePlatform/flash-attention/#amd-gpurocm-support
Download mpt-7b and dolly_hhrlhf dataset as train.jsonl and test.jsonl
modify yaml (~/llm-foundry/scripts/train/finetune_example/mpt-7b-arc-easy--gpu.yaml) as shown below
```
max_seq_len: 2048
global_seed: 17
```

Run Name

run_name: # If left blank, will be read from env var $RUN_NAME

model: name: mpt_causal_lm pretrained: true pretrained_model_name_or_path: /root/llm-foundry/mpt-7b max_seq_len: ${max_seq_len} attn_config: attn_impl: flash

Set this to `true` if using `train_loader.dataset.packing_ratio` below

attn_uses_sequence_id: false

loss_fn: torch_crossentropy

Tokenizer

tokenizer: name: /root/llm-foundry/mpt-7b kwargs: model_max_length: ${max_seq_len}

Dataloaders

train_loader: name: finetuning dataset: hf_name: json hf_kwargs: data_dir: /root/llm-foundry/data split: train max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true

Use `python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ...`

# # to profile this run's optimal packing_ratio as it depends on GPU count,
# # batch size, sequence length
# packing_ratio:
shuffle: true

drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0

eval_loader: name: finetuning dataset: hf_name: json hf_kwargs: data_dir: /root/llm-foundry/data split: test max_seq_len: ${max_seq_len} allow_pad_trimming: false decoder_only_format: true

packing_ratio:

shuffle: false

drop_last: true num_workers: 8 pin_memory: false prefetch_factor: 2 persistent_workers: true timeout: 0

Optimization

scheduler: name: linear_decay_with_warmup # linear no warmup is HF default which dolly used t_warmup: 50ba # add some warmup though, seems to help with MPT alpha_f: 0

optimizer:

Based on Dolly

name: decoupled_adamw lr: 5.0e-6 betas:

0.9
0.999 eps: 1.0e-8 weight_decay: 0

algorithms: gradient_clipping: clipping_type: norm clipping_threshold: 1.0

max_duration: 2ep # 2-3 epochs seems like the sweet spot eval_interval: 1ep

eval_subset_num_batches: -1

eval_first: true global_train_batch_size: 512 # somewhere in the 6-8 * numgpus range seems good

System

seed: ${global_seed} device_eval_batch_size: 8

device_train_microbatch_size: 4

device_train_microbatch_size: auto cprecision: amp_bf16

FSDP

fsdp_config: sharding_strategy: FULL_SHARD mixed_precision: PURE activation_checkpointing: false activation_checkpointing_reentrant: false activation_cpu_offload: false limit_all_gathers: true verbose: false

Logging

progress_bar: true log_to_console: true console_log_interval: 1ba

callbacks: speed_monitor: window_size: 10 gpu_flops_available: true lr_monitor: {} memory_monitor: {} runtime_estimator: {}

5.

cd ~/llm-foundry/scripts/train PYTHONPATH=$PWD composer train.py finetune_example/mpt-7b-arc-easy--gpu.yaml


## Expected behavior
The finetuning process does not progress.

abhi-mosaic commented 9 months ago

Hi @soonjune,

I'm not sure about the immediate cause of your run hanging, but I think I see one issue with your YAML, you should be using hf_causal_lm rather than mpt_causal_lm, in order to load an HF checkpoint (whether that is mosaicml/mpt-7b, or any HF model).

See here in the finetunign example: https://github.com/mosaicml/llm-foundry/blob/9027f49153d89e6b0b225af3626311a9b4658dbf/scripts/train/finetune_example/mpt-7b-arc-easy--gpu.yaml#L7-L23

I would also recommend using a specific microbatch size like device_train_microbatch_size: 4, rather than auto, as the latter relies on catching OOMs that I believe currently only works on NVIDIA cards.

I would also recommend using attn_impl: torch for finetuning our MPT-7B model, as we are still integrating ALiBi support into the AMD + FlashAttention stack. As soon as we complete testing of triton-based FlashAttention on AMD, you'll be able to use attn_impl:triton just like we do on NVIDIA cards.

soonjune commented 9 months ago

Thank you for the reply. I found that the hang was due to the rocm version of the host device. Hope AMD support comes out soon!

mosaicml / llm-foundry

Reproduction of AMD GPU tests #677

Environment

To reproduce

Run Name

Set this to `true` if using `train_loader.dataset.packing_ratio` below

Tokenizer

Dataloaders

Use `python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ...`

packing_ratio:

Optimization

Based on Dolly

eval_subset_num_batches: -1

System

device_train_microbatch_size: 4

FSDP

Logging

mosaicml / llm-foundry

Reproduction of AMD GPU tests #677

Environment

To reproduce

Run Name

Set this to true if using train_loader.dataset.packing_ratio below

Tokenizer

Dataloaders

Use python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ...

packing_ratio:

Optimization

Based on Dolly

eval_subset_num_batches: -1

System

device_train_microbatch_size: 4

FSDP

Logging

Set this to `true` if using `train_loader.dataset.packing_ratio` below

Use `python llmfoundry/data/packing.py --yaml-path /path/to/this/yaml/ ...`