mosaicml / examples

Fast and flexible reference benchmarks
Apache License 2.0
424 stars 121 forks source link

Finetuning script broken? #420

Closed mscherrmann closed 11 months ago

mscherrmann commented 11 months ago

Hey,

as finetuning after the import to transformers is not possible, I tried the finetuning script that you provide. I tried to run the function 'test_classification_script()' from 'tests/test_classification.py' as a first step to test your finetuning framework. To do so, I used a linux server with ubuntu and with 4 x NVIDIA Tesla P100 (16 GB). For the setup, I followed all the steps that you recommend here, i.e.:

I have installed the cuda release 117, as the following output suggests:

'nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0'

To test your finetuning script, I simply did the following in the console:

>>> python 
>>> from tests.test_classification import test_classification_script 
>>> test_classification_script()

Here is the complete output:

Training using config:
tokenizer_name: prajjwal1/bert-tiny
max_seq_len: 32
run_name: test
model:
  name: mosaic_bert
  num_labels: 2
  pretrained_model_name: ${tokenizer_name}
  tokenizer_name: ${tokenizer_name}
train_loader:
  split: train
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: true
  drop_last: true
  num_workers: 4
eval_loader:
  split: validation
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: false
  drop_last: false
  num_workers: 4
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 0.5dur
  alpha_f: 0.02
optimizer:
  name: decoupled_adamw
  lr: 0.0002
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
max_duration: 8ba
eval_interval: 8ba
eval_subset_num_batches: 2
global_train_batch_size: 4
seed: 17
device_eval_batch_size: 4
device_train_microbatch_size: 2
precision: fp32
progress_bar: false
log_to_console: false
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 4
  lr_monitor: {}

Initializing model...
n_params=4.4515e+06
Building train loader...
Found cached dataset glue (...)
Loading cached processed dataset at .../cache-qnli-prajjwal1,bert-tiny-tokenization-train.arrow
Building eval loader...
Found cached dataset glue (...)
Loading cached processed dataset at .../huggingface/datasets/glue/qnli/1.0.0.../cache-qnli-prajjwal1,bert-tiny-tokenization-validation.arrow
/usr/lib/python3/dist-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: fp32; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
  warnings.warn(
Logging config...
tokenizer_name: prajjwal1/bert-tiny
max_seq_len: 32
run_name: test
model:
  name: mosaic_bert
  num_labels: 2
  pretrained_model_name: ${tokenizer_name}
  tokenizer_name: ${tokenizer_name}
train_loader:
  split: train
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: true
  drop_last: true
  num_workers: 4
eval_loader:
  split: validation
  tokenizer_name: ${tokenizer_name}
  max_seq_len: ${max_seq_len}
  shuffle: false
  drop_last: false
  num_workers: 4
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 0.5dur
  alpha_f: 0.02
optimizer:
  name: decoupled_adamw
  lr: 0.0002
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
max_duration: 8ba
eval_interval: 8ba
eval_subset_num_batches: 2
global_train_batch_size: 4
seed: 17
device_eval_batch_size: 4
device_train_microbatch_size: 2
precision: fp32
progress_bar: false
log_to_console: false
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 4
  lr_monitor: {}
n_gpus: 1
device_train_batch_size: 4

Starting training...
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-d962222789c30252d492a16cca3bf467-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('matrix', False, 64, True, True, True, 128, 128), (True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/examples/examples/benchmarks/bert/tests/test_classification.py", line 14, in test_classification_script
    main(config)
  File "/examples/examples/benchmarks/bert/sequence_classification.py", line 317, in main
    trainer.fit()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1766, in fit
    self._train_loop()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 1940, in _train_loop
    total_loss_dict = self._train_batch(use_grad_scaling)
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2115, in _train_batch
    optimizer.step(closure=lambda **kwargs: self._train_microbatches(
  File "/usr/lib/python3/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/composer/optim/decoupled_weight_decay.py", line 288, in step
    loss = closure()
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2115, in <lambda>
    optimizer.step(closure=lambda **kwargs: self._train_microbatches(
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2213, in _train_microbatches
    microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_batch_size, is_final_microbatch)
  File "/usr/lib/python3/dist-packages/composer/trainer/trainer.py", line 2276, in _train_microbatch
    self.state.outputs = self.state.model(self.state.batch)
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/lib/python3/dist-packages/composer/models/huggingface.py", line 314, in forward
    output = self.model(**batch)  # type: ignore (thirdparty)
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 1009, in forward
    outputs = self.bert(
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 677, in forward
    encoder_outputs = self.encoder(
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 514, in forward
    hidden_states = layer_module(hidden_states,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 395, in forward
    attention_output = self.attention(hidden_states, cu_seqlens, seqlen,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 307, in forward
    self_output = self.self(input_tensor, cu_seqlens, max_s, indices,
  File "/usr/lib/python3/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/examples/examples/benchmarks/bert/src/bert_layers.py", line 237, in forward
    attention = flash_attn_qkvpacked_func(qkv, bias)
  File "/examples/examples/benchmarks/bert/src/flash_attn_triton.py", line 1021, in forward
    o, lse, ctx.softmax_scale = _flash_attn_forward(
  File "/examples/examples/benchmarks/bert/src/flash_attn_triton.py", line 826, in _flash_attn_forward
    _fwd_kernel[grid](  # type: ignore
  File "/usr/lib/python3/dist-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 86, in run
    return self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **kwargs, **config.kwargs)
  File "/usr/lib/python3/dist-packages/triton/runtime/autotuner.py", line 200, in run
    return self.fn.run(*args, **kwargs)
  File "<string>", line 41, in _fwd_kernel
  File "/usr/lib/python3/dist-packages/triton/compiler.py", line 1268, in compile
    return CompiledKernel(name, so_cache_manager._make_path(so_name), fn_cache_manager.cache_dir, device)
  File "/usr/lib/python3/dist-packages/triton/compiler.py", line 1301, in __init__
    mod, func, n_regs, n_spills = _triton.code_gen.load_binary(metadata["name"], self.asm["cubin"], self.shared, device)
RuntimeError: CUDA: Error- invalid source

Note that I replaced in the output above the paths with my personal information by (...).

Also note that the commands

yield the same error message.

Did I something wrong or is this an error in the code? I would be incredibly grateful for any guidance as I urgently need to fine-tune my model, but unfortunately, I'm currently facing the mentioned challenges that are preventing me from doing so.

Thank you very much!

dakinggg commented 11 months ago

I believe that triton flash attention will not work on P100s. Could you try uninstalling flash_attn_triton before running anything? I think then it will fall back to torch attention properly instead of trying to use flash attention and failing.

mscherrmann commented 11 months ago

Thank you for your quick response! Unfortunately I do not have a flash_attn_triton package installed. I only find flash_attn, but uninstalling it doesnt help.

dakinggg commented 11 months ago

Apologies, I think I got the package wrong, and it's actually triton you want to uninstall. flash_attn_triton is a file in our repo. We have a try/catch around importing it, which would disable the triton attention implementation, but I guess for you the import succeeds and then it fails when it starts actually running. So I want to make that import fail so that triton is disabled.

mscherrmann commented 11 months ago

Did also not work for me unfortunately. However, I just switched to pretrain hf-bert, that works fine.

Thank you for your help!