RuntimeError: Tensors must be CUDA and dense

Apologies for the miss, here are all the details required. It turns out the error Tensors must be CUDA and dense occurs when I use a very small size of max_seq_len. I increased it to 6144, and now the error is Triton Error [CUDA]: invalid argument. Same error occurs for a max_seq_len of 512.

Environment

Collecting system information...
---------------------------------
System Environment Report        
Created: 2023-12-06 07:39:30 GMT
---------------------------------

PyTorch information
-------------------
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Oracle Linux Server 8.8 (x86_64)
GCC version: (conda-forge gcc 9.5.0-17) 9.5.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.28

Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-106.131.4.el8uek.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A10
GPU 1: NVIDIA A10

Nvidia driver version: 535.86.10
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.9.5
/usr/lib64/libcudnn_adv_infer.so.8.9.5
/usr/lib64/libcudnn_adv_train.so.8.9.5
/usr/lib64/libcudnn_cnn_infer.so.8.9.5
/usr/lib64/libcudnn_cnn_train.so.8.9.5
/usr/lib64/libcudnn_ops_infer.so.8.9.5
/usr/lib64/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Stepping:            6
CPU MHz:             1064.043
CPU max MHz:         3400.0000
CPU min MHz:         800.0000
BogoMIPS:            5200.00
Virtualization:      VT-x
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            49152K
NUMA node0 CPU(s):   0-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect nt_good wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.0.1+cu117
[pip3] torch-optimizer==0.3.0
[pip3] torchdata==0.6.1
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.15.2
[pip3] torchvision==0.15.2
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] torch                     2.0.1+cu117              pypi_0    pypi
[conda] torch-optimizer           0.3.0                    pypi_0    pypi
[conda] torchdata                 0.6.1                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchtext                 0.15.2                   pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi

Composer information
--------------------
Composer version: 0.15.1
Composer commit hash: None
Host processor model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Host processor core count: 32
Number of nodes: 1
Accelerator model name: NVIDIA A10
Accelerators per node: 1
CUDA Device Count: 2

To reproduce

Steps to reproduce the behavior:

composer scripts/train/train.py /home/abc/mpt-finetuning/config/multitask-finetuning.yaml --run_name test-run

Expected behavior

Training should proceed without error

Actual Behavior

[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:52375 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:52375 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:52375 (errno: 97 - Address family not supported by protocol).
Initializing model...
/home/abc/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-instruct/1ec8e55b71f455075b8076b9918a1457f273918b/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`
  warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.48s/it]
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
cfg.n_params=6.65e+09
Building train loader...
Using pad_token, but it is not set yet.
No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
Found cached dataset json (/home/abc/.cache/huggingface/datasets/json/default-a05e2c23af93e2d7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading cached processed dataset at /home/abc/.cache/huggingface/datasets/json/default-a05e2c23af93e2d7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-3470419f1e1cecbc.arrow
Loading cached processed dataset at /home/abc/.cache/huggingface/datasets/json/default-a05e2c23af93e2d7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-4dd571d2bd1d9d7e.arrow
Building eval loader...
No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
Found cached dataset json (/home/abc/.cache/huggingface/datasets/json/default-b001653540d7ae9c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading cached processed dataset at /home/abc/.cache/huggingface/datasets/json/default-b001653540d7ae9c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-5f518f1d4f06991f.arrow
Loading cached processed dataset at /home/abc/.cache/huggingface/datasets/json/default-b001653540d7ae9c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1abe8fe12e7f31cc.arrow
Building trainer...
2023-12-06 07:00:01,168: rank0[131909][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2023-12-06 07:00:01,168: rank0[131909][MainThread]: INFO: composer.trainer.trainer: Run name: llm
/home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.py:1029: UserWarning: Setting both `progress_bar` and `log_to_console` both to True is not recommended and willlead to duplicate logs and weird formatting issues. Please set one of them to False for a better logging experience.
  warnings.warn(
                                                                                                                                                                                   /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
  warnings.warn(
/home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/callbacks/memory_monitor.py:86: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
2023-12-06 07:00:01,367: rank0[131909][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2023-12-06 07:00:04,045: rank0[131909][MainThread]: INFO: composer.trainer.trainer: Setting seed to 17
2023-12-06 07:00:04,045: rank0[131909][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
Logging config...
max_seq_len: 6144
global_seed: 17
run_name: llm
model:
  name: hf_causal_lm
  pretrained: true
  pretrained_model_name_or_path: mosaicml/mpt-7b-instruct
  config_overrides:
    max_seq_len: ${max_seq_len}
    attn_config:
      attn_impl: triton
      attn_uses_sequence_id: false
tokenizer:
  name: mosaicml/mpt-7b-instruct
  kwargs:
    model_max_length: ${max_seq_len}
dataset:
  hf_name: json
  max_seq_len: ${max_seq_len}
  allow_pad_trimming: false
  decoder_only_format: true
  shuffle: true
train_loader:
  name: finetuning
  dataset:
    hf_name: json
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true
    hf_kwargs:
      data_dir: /home/abc/mpt-finetuning/datasets/xyz/json-ds/train
    split: train
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0
eval_loader:
  name: finetuning
  dataset:
    hf_name: json
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true
    hf_kwargs:
      data_dir: /home/abc/mpt-finetuning/datasets/xyz/json-ds/validation
    split: validation
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 50ba
  alpha_f: 0
optimizer:
  name: decoupled_adamw
  lr: 5.0e-06
  betas:
  - 0.9
  - 0.999
  eps: 1.0e-08
  weight_decay: 0
algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0
max_duration: 9ep
eval_interval: 1ep
eval_subset_num_batches: -1
global_train_batch_size: 2
seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1
precision: amp_bf16
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: true
  limit_all_gathers: true
  verbose: false
progress_bar: true
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
save_interval: 1ep
save_folder: /home/abc/mpt-finetuning/finetuned-models/multitask-finetuning/{run_name}
--run_name: null
test-run: null
dist_timeout: 600.0
n_gpus: 2
device_train_batch_size: 1
device_train_grad_accum: 1
n_params: 6649286656

Starting training...
2023-12-06 07:00:04,049: rank0[131909][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16
******************************
Config:
enabled_algorithms/GradientClipping: true
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 2
num_nodes: 1
rank_zero_seed: 17

******************************
******************************
Config:
enabled_algorithms/GradientClipping: true
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 2
num_nodes: 1
rank_zero_seed: 17

******************************
2023-12-06 07:00:04,051: rank0[131909][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
                                                                                                                                                                                   ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                                                 
│ in _bwd_kernel:21                                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 
('2-.-0-.-0-7d1eb0d2fed8ff2032dccb99c2cc311a-394352f6a8351feaac334fbb8cc63fa4-46c7c5d46afed8316facd72e7e581bec-eeb54539cec859823cd0f9e632a7b8c5-39e3c68a052760cc345a9147b0d68f7d-5c5
e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-4ac47e74762ba6a774cceea0e1e75ae6-13b7ffc189bd9fba7696034bbcfee151', (torch.bfloat16, torch.bfloat16, torch.bfloat16, 
torch.float32, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 
'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, 
False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, 
False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, 
False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, 
False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/abc/mpt-finetuning/llm-foundry/scripts/train/train.py:326 in <module>                    │
│                                                                                                  │
│   323 │   │   yaml_cfg = om.load(f)                                                              │
│   324 │   cli_cfg = om.from_cli(args_list)                                                       │
│   325 │   cfg = om.merge(yaml_cfg, cli_cfg)                                                      │
│ ❱ 326 │   main(cfg)                                                                              │
│   327                                                                                            │
│                                                                                                  │
│ /home/abc/mpt-finetuning/llm-foundry/scripts/train/train.py:315 in main                        │
│                                                                                                  │
│   312 │   │   trainer.eval()                                                                     │
│   313 │                                                                                          │
│   314 │   print('Starting training...')                                                          │
│ ❱ 315 │   trainer.fit()                                                                          │
│   316 │                                                                                          │
│   317 │   print('Done.')                                                                         │
│   318                                                                                            │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:1804 in fit                                                                                    │
│                                                                                                  │
│   1801 │   │   │   self.state.scaler = ClosureGradScaler() if self._use_closures() else GradSca  │
│   1802 │   │                                                                                     │
│   1803 │   │   self.first_batch_complete = False                                                 │
│ ❱ 1804 │   │   self._train_loop()                                                                │
│   1805 │                                                                                         │
│   1806 │   def close(self):                                                                      │
│   1807 │   │   """Shutdown the trainer.                                                          │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:1979 in _train_loop                                                                            │
│                                                                                                  │
│   1976 │   │   │   │   │   │   self.logger.log_metrics({'time/token': self.state.timestamp.toke  │
│   1977 │   │   │   │   │   │   self.logger.log_metrics({'time/token_in_epoch': self.state.times  │
│   1978 │   │   │   │   │                                                                         │
│ ❱ 1979 │   │   │   │   │   total_loss_dict = self._train_batch(use_grad_scaling)                 │
│   1980 │   │   │   │   │                                                                         │
│   1981 │   │   │   │   │   if use_grad_scaling:                                                  │
│   1982 │   │   │   │   │   │   self.state.scaler.update()                                        │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:2163 in _train_batch                                                                           │
│                                                                                                  │
│   2160 │   │   │   │   │   │   │   │   │   │   │   │      closure=lambda loss_dict=total_loss_d  │
│   2161 │   │   │   │   │   │   │   │   │   │   │   │      _train_microbatches(microbatches, los  │
│   2162 │   │   │   │   │   │   else:                                                             │
│ ❱ 2163 │   │   │   │   │   │   │   optimizer.step(closure=lambda loss_dict=total_loss_dict, **k  │
│   2164 │   │   │   │   │   │   │   │   microbatches, loss_dict, **kwargs).item())                │
│   2165 │   │   │   │   else:                                                                     │
│   2166 │   │   │   │   │   self._train_microbatches(microbatches, total_loss_dict)               │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/optim/lr_scheduler.p │
│ y:69 in wrapper                                                                                  │
│                                                                                                  │
│     66 │   │   │   │   instance = instance_ref()                                                 │
│     67 │   │   │   │   instance._step_count += 1                                                 │
│     68 │   │   │   │   wrapped = func.__get__(instance, cls)                                     │
│ ❱   69 │   │   │   │   return wrapped(*args, **kwargs)                                           │
│     70 │   │   │                                                                                 │
│     71 │   │   │   # Note that the returned function here is no longer a bound method,           │
│     72 │   │   │   # so attributes like `__func__` and `__self__` no longer exist.               │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/optim/optimizer.py:2 │
│ 80 in wrapper                                                                                    │
│                                                                                                  │
│   277 │   │   │   │   │   │   │   raise RuntimeError(f"{func} must return None or a tuple of (   │
│   278 │   │   │   │   │   │   │   │   │   │   │      f"but got {result}.")                       │
│   279 │   │   │   │                                                                              │
│ ❱ 280 │   │   │   │   out = func(*args, **kwargs)                                                │
│   281 │   │   │   │   self._optimizer_step_code()                                                │
│   282 │   │   │   │                                                                              │
│   283 │   │   │   │   # call optimizer step post hooks                                           │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/utils/_contextlib.py │
│ :115 in decorate_context                                                                         │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/optim/decoupled_w │
│ eight_decay.py:288 in step                                                                       │
│                                                                                                  │
│   285 │   │   loss = None                                                                        │
│   286 │   │   if closure is not None:                                                            │
│   287 │   │   │   with torch.enable_grad():                                                      │
│ ❱ 288 │   │   │   │   loss = closure()                                                           │
│   289 │   │                                                                                      │
│   290 │   │   for group in self.param_groups:                                                    │
│   291 │   │   │   params_with_grad = []                                                          │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:2163 in <lambda>                                                                               │
│                                                                                                  │
│   2160 │   │   │   │   │   │   │   │   │   │   │   │      closure=lambda loss_dict=total_loss_d  │
│   2161 │   │   │   │   │   │   │   │   │   │   │   │      _train_microbatches(microbatches, los  │
│   2162 │   │   │   │   │   │   else:                                                             │
│ ❱ 2163 │   │   │   │   │   │   │   optimizer.step(closure=lambda loss_dict=total_loss_dict, **k  │
│   2164 │   │   │   │   │   │   │   │   microbatches, loss_dict, **kwargs).item())                │
│   2165 │   │   │   │   else:                                                                     │
│   2166 │   │   │   │   │   self._train_microbatches(microbatches, total_loss_dict)               │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:2266 in _train_microbatches                                                                    │
│                                                                                                  │
│   2263 │   │   │                                                                                 │
│   2264 │   │   │   for microbatch_idx, self.state.batch in enumerate(microbatches):              │
│   2265 │   │   │   │   is_final_microbatch = microbatch_idx + 1 == len(microbatches)             │
│ ❱ 2266 │   │   │   │   microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_  │
│   2267 │   │   │   │                                                                             │
│   2268 │   │   │   │   # Aggregate each loss in microbatch_loss_dict into total_loss_dict        │
│   2269 │   │   │   │   for k, microbatch_loss in microbatch_loss_dict.items():                   │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:2393 in _train_microbatch                                                                      │
│                                                                                                  │
│   2390 │   │   │   else:                                                                         │
│   2391 │   │   │   │   # Scale loss based on the number of samples in the microbatch to maintai  │
│   2392 │   │   │   │   microbatch_loss.mul_(microbatch_num_samples / current_batch_size)         │
│ ❱ 2393 │   │   │   │   microbatch_loss.backward(create_graph=self._backwards_create_graph)       │
│   2394 │   │   │                                                                                 │
│   2395 │   │   │   self.engine.run_event(Event.AFTER_BACKWARD)                                   │
│   2396                                                                                           │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/_tensor.py:487 in    │
│ backward                                                                                         │
│                                                                                                  │
│    484 │   │   │   │   create_graph=create_graph,                                                │
│    485 │   │   │   │   inputs=inputs,                                                            │
│    486 │   │   │   )                                                                             │
│ ❱  487 │   │   torch.autograd.backward(                                                          │
│    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    489 │   │   )                                                                                 │
│    490                                                                                           │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/autograd/__init__.py │
│ :200 in backward                                                                                 │
│                                                                                                  │
│   197 │   # The reason we repeat same the comment below is that                                  │
│   198 │   # some Python versions print out the first line of a multi-line function               │
│   199 │   # calls in the traceback and some print out the last line                              │
│ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   203                                                                                            │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/autograd/function.py │
│ :274 in apply                                                                                    │
│                                                                                                  │
│   271 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   272 │   │   │   │   │   │   │      "of them.")                                                 │
│   273 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 274 │   │   return user_fn(self, *args)                                                        │
│   275 │                                                                                          │
│   276 │   def apply_jvp(self, *args):                                                            │
│   277 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /home/abc/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-instruct/1ec8e55b71f │
│ 455075b8076b9918a1457f273918b/flash_attn_triton.py:482 in backward                               │
│                                                                                                  │
│   479 │   │   │   dq = torch.empty_like(q)                                                       │
│   480 │   │   │   dk = torch.empty_like(k)                                                       │
│   481 │   │   │   dv = torch.empty_like(v)                                                       │
│ ❱ 482 │   │   │   _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=bias, causal=ctx.   │
│   483 │   │   return (dq, dk, dv, None, None, None)                                              │
│   484 flash_attn_func = FlashAttnFunc.apply                                                      │
│   485                                                                                            │
│                                                                                                  │
│ /home/abc/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-instruct/1ec8e55b71f │
│ 455075b8076b9918a1457f273918b/flash_attn_triton.py:398 in _flash_attn_backward                   │
│                                                                                                  │
│   395 │   │   bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)                              │
│   396 │   bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0,   │
│   397 │   grid = lambda META: (triton.cdiv(seqlen_k, META['BLOCK_N']) if META['SEQUENCE_PARALL   │
│ ❱ 398 │   _bwd_kernel[grid](q, k, v, bias, do, dq_accum, dk, dv, lse, delta, softmax_scale, q.   │
│   399 │   dq.copy_(dq_accum)                                                                     │
│   400                                                                                            │
│   401 class FlashAttnQKVPackedFunc(torch.autograd.Function):                                     │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/ji │
│ t.py:106 in launcher                                                                             │
│                                                                                                  │
│   103 │   │   memorizes the grid.                                                                │
│   104 │   │   """                                                                                │
│   105 │   │   def launcher(*args, **kwargs):                                                     │
│ ❱ 106 │   │   │   return self.run(*args, grid=grid, **kwargs)                                    │
│   107 │   │   return launcher                                                                    │
│   108                                                                                            │
│   109                                                                                            │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:73 in run                                                                             │
│                                                                                                  │
│    70 │   │   │   │   # prune configs                                                            │
│    71 │   │   │   │   pruned_configs = self.prune_configs(kwargs)                                │
│    72 │   │   │   │   bench_start = time.time()                                                  │
│ ❱  73 │   │   │   │   timings = {config: self._bench(*args, config=config, **kwargs)             │
│    74 │   │   │   │   │   │      for config in pruned_configs}                                   │
│    75 │   │   │   │   bench_end = time.time()                                                    │
│    76 │   │   │   │   self.bench_time = bench_end - bench_start                                  │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:73 in <dictcomp>                                                                      │
│                                                                                                  │
│    70 │   │   │   │   # prune configs                                                            │
│    71 │   │   │   │   pruned_configs = self.prune_configs(kwargs)                                │
│    72 │   │   │   │   bench_start = time.time()                                                  │
│ ❱  73 │   │   │   │   timings = {config: self._bench(*args, config=config, **kwargs)             │
│    74 │   │   │   │   │   │      for config in pruned_configs}                                   │
│    75 │   │   │   │   bench_end = time.time()                                                    │
│    76 │   │   │   │   self.bench_time = bench_end - bench_start                                  │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:63 in _bench                                                                          │
│                                                                                                  │
│    60 │   │   │   │   config.pre_hook(self.nargs)                                                │
│    61 │   │   │   self.hook(args)                                                                │
│    62 │   │   │   self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages,   │
│ ❱  63 │   │   return do_bench(kernel_call)                                                       │
│    64 │                                                                                          │
│    65 │   def run(self, *args, **kwargs):                                                        │
│    66 │   │   self.nargs = dict(zip(self.arg_names, args))                                       │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/testing.py │
│ :140 in do_bench                                                                                 │
│                                                                                                  │
│   137 │   """                                                                                    │
│   138 │                                                                                          │
│   139 │   # Estimate the runtime of the function                                                 │
│ ❱ 140 │   fn()                                                                                   │
│   141 │   torch.cuda.synchronize()                                                               │
│   142 │   start_event = torch.cuda.Event(enable_timing=True)                                     │
│   143 │   end_event = torch.cuda.Event(enable_timing=True)                                       │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:62 in kernel_call                                                                     │
│                                                                                                  │
│    59 │   │   │   if config.pre_hook:                                                            │
│    60 │   │   │   │   config.pre_hook(self.nargs)                                                │
│    61 │   │   │   self.hook(args)                                                                │
│ ❱  62 │   │   │   self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages,   │
│    63 │   │   return do_bench(kernel_call)                                                       │
│    64 │                                                                                          │
│    65 │   def run(self, *args, **kwargs):                                                        │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:200 in run                                                                            │
│                                                                                                  │
│   197 │   def run(self, *args, **kwargs):                                                        │
│   198 │   │   for v, heur in self.values.items():                                                │
│   199 │   │   │   kwargs[v] = heur({**dict(zip(self.arg_names, args)), **kwargs})                │
│ ❱ 200 │   │   return self.fn.run(*args, **kwargs)                                                │
│   201                                                                                            │
│   202                                                                                            │
│   203 def heuristics(values):                                                                    │
│ in _bwd_kernel:43                                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Triton Error [CUDA]: invalid argument
2023-12-06 07:00:12,097: rank0[131909][MainThread]: DEBUG: composer.core.engine: Closing the engine
2023-12-06 07:00:12,097: rank0[131909][MainThread]: DEBUG: composer.core.engine: Closing callback ProgressBarLogger

train          Epoch   0:    0%|                         | 0/444 [00:07<?, ?ba/s]                                                                                                   
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Closing callback ConsoleLogger                                                                     
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Closing callback SpeedMonitor
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Closing callback LRMonitor
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Closing callback MemoryMonitor
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Closing callback RuntimeEstimator
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Closing callback CheckpointSaver
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Post-closing callback ProgressBarLogger
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Post-closing callback ConsoleLogger
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Post-closing callback SpeedMonitor
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Post-closing callback LRMonitor
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Post-closing callback MemoryMonitor
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Post-closing callback RuntimeEstimator
2023-12-06 07:00:12,098: rank0[131909][MainThread]: DEBUG: composer.core.engine: Post-closing callback CheckpointSaver
ERROR:composer.cli.launcher:Rank 1 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 1 (PID 131910) exited with code 1
----------Begin global rank 1 STDOUT----------
Initializing model...
cfg.n_params=6.65e+09
Building train loader...
No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
Building eval loader...
No preprocessor was supplied and no preprocessing function is registered for dataset name "json". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
Building trainer...
Logging config...
max_seq_len: 6144
global_seed: 17
run_name: llm
model:
  name: hf_causal_lm
  pretrained: true
  pretrained_model_name_or_path: mosaicml/mpt-7b-instruct
  config_overrides:
    max_seq_len: ${max_seq_len}
    attn_config:
      attn_impl: triton
      attn_uses_sequence_id: false
tokenizer:
  name: mosaicml/mpt-7b-instruct
  kwargs:
    model_max_length: ${max_seq_len}
dataset:
  hf_name: json
  max_seq_len: ${max_seq_len}
  allow_pad_trimming: false
  decoder_only_format: true
  shuffle: true
train_loader:
  name: finetuning
  dataset:
    hf_name: json
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true
    hf_kwargs:
      data_dir: /home/abc/mpt-finetuning/datasets/xyz/json-ds/train
    split: train
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0
eval_loader:
  name: finetuning
  dataset:
    hf_name: json
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    shuffle: true
    hf_kwargs:
      data_dir: /home/abc/mpt-finetuning/datasets/xyz/json-ds/validation
    split: validation
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0
scheduler:
  name: linear_decay_with_warmup
  t_warmup: 50ba
  alpha_f: 0
optimizer:
  name: decoupled_adamw
  lr: 5.0e-06
  betas:
  - 0.9
  - 0.999
  eps: 1.0e-08
  weight_decay: 0
algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0
max_duration: 9ep
eval_interval: 1ep
eval_subset_num_batches: -1
global_train_batch_size: 2
seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1
precision: amp_bf16
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: true
  limit_all_gathers: true
  verbose: false
progress_bar: true
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}
save_interval: 1ep
save_folder: /home/abc/mpt-finetuning/finetuned-models/multitask-finetuning/{run_name}
--run_name: null
test-run: null
dist_timeout: 600.0
n_gpus: 2
device_train_batch_size: 1
device_train_grad_accum: 1
n_params: 6649286656

Starting training...

----------End global rank 1 STDOUT----------
----------Begin global rank 1 STDERR----------
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:52375 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:52375 (errno: 97 - Address family not supported by protocol).
/home/abc/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-instruct/1ec8e55b71f455075b8076b9918a1457f273918b/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`
  warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:03<00:03,  3.91s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.42s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.65s/it]
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Using pad_token, but it is not set yet.
Found cached dataset json (/home/abc/.cache/huggingface/datasets/json/default-a05e2c23af93e2d7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading cached processed dataset at /home/abc/.cache/huggingface/datasets/json/default-a05e2c23af93e2d7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-3470419f1e1cecbc.arrow
Loading cached processed dataset at /home/abc/.cache/huggingface/datasets/json/default-a05e2c23af93e2d7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-4dd571d2bd1d9d7e.arrow
Found cached dataset json (/home/abc/.cache/huggingface/datasets/json/default-b001653540d7ae9c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading cached processed dataset at /home/abc/.cache/huggingface/datasets/json/default-b001653540d7ae9c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-5f518f1d4f06991f.arrow
Loading cached processed dataset at /home/abc/.cache/huggingface/datasets/json/default-b001653540d7ae9c/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-1abe8fe12e7f31cc.arrow
2023-12-06 07:00:01,172: rank1[131910][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 18
2023-12-06 07:00:01,172: rank1[131910][MainThread]: INFO: composer.trainer.trainer: Run name: llm
/home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.py:1029: UserWarning: Setting both `progress_bar` and `log_to_console` both to True is not recommended and willlead to duplicate logs and weird formatting issues. Please set one of them to False for a better logging experience.
  warnings.warn(

                                                                                                                       /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
  warnings.warn(
/home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/callbacks/memory_monitor.py:86: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
2023-12-06 07:00:01,367: rank1[131910][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2023-12-06 07:00:04,037: rank1[131910][MainThread]: INFO: composer.trainer.trainer: Setting seed to 18
2023-12-06 07:00:04,037: rank1[131910][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 18
2023-12-06 07:00:04,042: rank1[131910][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16
2023-12-06 07:00:04,051: rank1[131910][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in _bwd_kernel:21                                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 
('2-.-0-.-0-7d1eb0d2fed8ff2032dccb99c2cc311a-394352f6a8351feaac334fbb8cc63fa4-46c7c5d46afed8316facd72e7e581bec-eeb54539cec859823cd0f9e632a7b8c5-39e3c68a052760cc345a9147b0d68f7d-5c5
e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-4ac47e74762ba6a774cceea0e1e75ae6-13b7ffc189bd9fba7696034bbcfee151', (torch.bfloat16, torch.bfloat16, torch.bfloat16, 
torch.float32, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 
'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, 
False, True, True, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, 
False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, 
False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, 
False), (True, False), (True, False)))

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/abc/mpt-finetuning/llm-foundry/scripts/train/train.py:326 in <module>                    │
│                                                                                                  │
│   323 │   │   yaml_cfg = om.load(f)                                                              │
│   324 │   cli_cfg = om.from_cli(args_list)                                                       │
│   325 │   cfg = om.merge(yaml_cfg, cli_cfg)                                                      │
│ ❱ 326 │   main(cfg)                                                                              │
│   327                                                                                            │
│                                                                                                  │
│ /home/abc/mpt-finetuning/llm-foundry/scripts/train/train.py:315 in main                        │
│                                                                                                  │
│   312 │   │   trainer.eval()                                                                     │
│   313 │                                                                                          │
│   314 │   print('Starting training...')                                                          │
│ ❱ 315 │   trainer.fit()                                                                          │
│   316 │                                                                                          │
│   317 │   print('Done.')                                                                         │
│   318                                                                                            │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:1804 in fit                                                                                    │
│                                                                                                  │
│   1801 │   │   │   self.state.scaler = ClosureGradScaler() if self._use_closures() else GradSca  │
│   1802 │   │                                                                                     │
│   1803 │   │   self.first_batch_complete = False                                                 │
│ ❱ 1804 │   │   self._train_loop()                                                                │
│   1805 │                                                                                         │
│   1806 │   def close(self):                                                                      │
│   1807 │   │   """Shutdown the trainer.                                                          │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:1979 in _train_loop                                                                            │
│                                                                                                  │
│   1976 │   │   │   │   │   │   self.logger.log_metrics({'time/token': self.state.timestamp.toke  │
│   1977 │   │   │   │   │   │   self.logger.log_metrics({'time/token_in_epoch': self.state.times  │
│   1978 │   │   │   │   │                                                                         │
│ ❱ 1979 │   │   │   │   │   total_loss_dict = self._train_batch(use_grad_scaling)                 │
│   1980 │   │   │   │   │                                                                         │
│   1981 │   │   │   │   │   if use_grad_scaling:                                                  │
│   1982 │   │   │   │   │   │   self.state.scaler.update()                                        │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:2163 in _train_batch                                                                           │
│                                                                                                  │
│   2160 │   │   │   │   │   │   │   │   │   │   │   │      closure=lambda loss_dict=total_loss_d  │
│   2161 │   │   │   │   │   │   │   │   │   │   │   │      _train_microbatches(microbatches, los  │
│   2162 │   │   │   │   │   │   else:                                                             │
│ ❱ 2163 │   │   │   │   │   │   │   optimizer.step(closure=lambda loss_dict=total_loss_dict, **k  │
│   2164 │   │   │   │   │   │   │   │   microbatches, loss_dict, **kwargs).item())                │
│   2165 │   │   │   │   else:                                                                     │
│   2166 │   │   │   │   │   self._train_microbatches(microbatches, total_loss_dict)               │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/optim/lr_scheduler.p │
│ y:69 in wrapper                                                                                  │
│                                                                                                  │
│     66 │   │   │   │   instance = instance_ref()                                                 │
│     67 │   │   │   │   instance._step_count += 1                                                 │
│     68 │   │   │   │   wrapped = func.__get__(instance, cls)                                     │
│ ❱   69 │   │   │   │   return wrapped(*args, **kwargs)                                           │
│     70 │   │   │                                                                                 │
│     71 │   │   │   # Note that the returned function here is no longer a bound method,           │
│     72 │   │   │   # so attributes like `__func__` and `__self__` no longer exist.               │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/optim/optimizer.py:2 │
│ 80 in wrapper                                                                                    │
│                                                                                                  │
│   277 │   │   │   │   │   │   │   raise RuntimeError(f"{func} must return None or a tuple of (   │
│   278 │   │   │   │   │   │   │   │   │   │   │      f"but got {result}.")                       │
│   279 │   │   │   │                                                                              │
│ ❱ 280 │   │   │   │   out = func(*args, **kwargs)                                                │
│   281 │   │   │   │   self._optimizer_step_code()                                                │
│   282 │   │   │   │                                                                              │
│   283 │   │   │   │   # call optimizer step post hooks                                           │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/utils/_contextlib.py │
│ :115 in decorate_context                                                                         │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/optim/decoupled_w │
│ eight_decay.py:288 in step                                                                       │
│                                                                                                  │
│   285 │   │   loss = None                                                                        │
│   286 │   │   if closure is not None:                                                            │
│   287 │   │   │   with torch.enable_grad():                                                      │
│ ❱ 288 │   │   │   │   loss = closure()                                                           │
│   289 │   │                                                                                      │
│   290 │   │   for group in self.param_groups:                                                    │
│   291 │   │   │   params_with_grad = []                                                          │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:2163 in <lambda>                                                                               │
│                                                                                                  │
│   2160 │   │   │   │   │   │   │   │   │   │   │   │      closure=lambda loss_dict=total_loss_d  │
│   2161 │   │   │   │   │   │   │   │   │   │   │   │      _train_microbatches(microbatches, los  │
│   2162 │   │   │   │   │   │   else:                                                             │
│ ❱ 2163 │   │   │   │   │   │   │   optimizer.step(closure=lambda loss_dict=total_loss_dict, **k  │
│   2164 │   │   │   │   │   │   │   │   microbatches, loss_dict, **kwargs).item())                │
│   2165 │   │   │   │   else:                                                                     │
│   2166 │   │   │   │   │   self._train_microbatches(microbatches, total_loss_dict)               │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:2266 in _train_microbatches                                                                    │
│                                                                                                  │
│   2263 │   │   │                                                                                 │
│   2264 │   │   │   for microbatch_idx, self.state.batch in enumerate(microbatches):              │
│   2265 │   │   │   │   is_final_microbatch = microbatch_idx + 1 == len(microbatches)             │
│ ❱ 2266 │   │   │   │   microbatch_loss_dict = self._train_microbatch(use_grad_scaling, current_  │
│   2267 │   │   │   │                                                                             │
│   2268 │   │   │   │   # Aggregate each loss in microbatch_loss_dict into total_loss_dict        │
│   2269 │   │   │   │   for k, microbatch_loss in microbatch_loss_dict.items():                   │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/composer/trainer/trainer.p │
│ y:2393 in _train_microbatch                                                                      │
│                                                                                                  │
│   2390 │   │   │   else:                                                                         │
│   2391 │   │   │   │   # Scale loss based on the number of samples in the microbatch to maintai  │
│   2392 │   │   │   │   microbatch_loss.mul_(microbatch_num_samples / current_batch_size)         │
│ ❱ 2393 │   │   │   │   microbatch_loss.backward(create_graph=self._backwards_create_graph)       │
│   2394 │   │   │                                                                                 │
│   2395 │   │   │   self.engine.run_event(Event.AFTER_BACKWARD)                                   │
│   2396                                                                                           │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/_tensor.py:487 in    │
│ backward                                                                                         │
│                                                                                                  │
│    484 │   │   │   │   create_graph=create_graph,                                                │
│    485 │   │   │   │   inputs=inputs,                                                            │
│    486 │   │   │   )                                                                             │
│ ❱  487 │   │   torch.autograd.backward(                                                          │
│    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    489 │   │   )                                                                                 │
│    490                                                                                           │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/autograd/__init__.py │
│ :200 in backward                                                                                 │
│                                                                                                  │
│   197 │   # The reason we repeat same the comment below is that                                  │
│   198 │   # some Python versions print out the first line of a multi-line function               │
│   199 │   # calls in the traceback and some print out the last line                              │
│ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   203                                                                                            │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/torch/autograd/function.py │
│ :274 in apply                                                                                    │
│                                                                                                  │
│   271 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   272 │   │   │   │   │   │   │      "of them.")                                                 │
│   273 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 274 │   │   return user_fn(self, *args)                                                        │
│   275 │                                                                                          │
│   276 │   def apply_jvp(self, *args):                                                            │
│   277 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /home/abc/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-instruct/1ec8e55b71f │
│ 455075b8076b9918a1457f273918b/flash_attn_triton.py:482 in backward                               │
│                                                                                                  │
│   479 │   │   │   dq = torch.empty_like(q)                                                       │
│   480 │   │   │   dk = torch.empty_like(k)                                                       │
│   481 │   │   │   dv = torch.empty_like(v)                                                       │
│ ❱ 482 │   │   │   _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=bias, causal=ctx.   │
│   483 │   │   return (dq, dk, dv, None, None, None)                                              │
│   484 flash_attn_func = FlashAttnFunc.apply                                                      │
│   485                                                                                            │
│                                                                                                  │
│ /home/abc/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-instruct/1ec8e55b71f │
│ 455075b8076b9918a1457f273918b/flash_attn_triton.py:398 in _flash_attn_backward                   │
│                                                                                                  │
│   395 │   │   bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)                              │
│   396 │   bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0,   │
│   397 │   grid = lambda META: (triton.cdiv(seqlen_k, META['BLOCK_N']) if META['SEQUENCE_PARALL   │
│ ❱ 398 │   _bwd_kernel[grid](q, k, v, bias, do, dq_accum, dk, dv, lse, delta, softmax_scale, q.   │
│   399 │   dq.copy_(dq_accum)                                                                     │
│   400                                                                                            │
│   401 class FlashAttnQKVPackedFunc(torch.autograd.Function):                                     │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/ji │
│ t.py:106 in launcher                                                                             │
│                                                                                                  │
│   103 │   │   memorizes the grid.                                                                │
│   104 │   │   """                                                                                │
│   105 │   │   def launcher(*args, **kwargs):                                                     │
│ ❱ 106 │   │   │   return self.run(*args, grid=grid, **kwargs)                                    │
│   107 │   │   return launcher                                                                    │
│   108                                                                                            │
│   109                                                                                            │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:73 in run                                                                             │
│                                                                                                  │
│    70 │   │   │   │   # prune configs                                                            │
│    71 │   │   │   │   pruned_configs = self.prune_configs(kwargs)                                │
│    72 │   │   │   │   bench_start = time.time()                                                  │
│ ❱  73 │   │   │   │   timings = {config: self._bench(*args, config=config, **kwargs)             │
│    74 │   │   │   │   │   │      for config in pruned_configs}                                   │
│    75 │   │   │   │   bench_end = time.time()                                                    │
│    76 │   │   │   │   self.bench_time = bench_end - bench_start                                  │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:73 in <dictcomp>                                                                      │
│                                                                                                  │
│    70 │   │   │   │   # prune configs                                                            │
│    71 │   │   │   │   pruned_configs = self.prune_configs(kwargs)                                │
│    72 │   │   │   │   bench_start = time.time()                                                  │
│ ❱  73 │   │   │   │   timings = {config: self._bench(*args, config=config, **kwargs)             │
│    74 │   │   │   │   │   │      for config in pruned_configs}                                   │
│    75 │   │   │   │   bench_end = time.time()                                                    │
│    76 │   │   │   │   self.bench_time = bench_end - bench_start                                  │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:63 in _bench                                                                          │
│                                                                                                  │
│    60 │   │   │   │   config.pre_hook(self.nargs)                                                │
│    61 │   │   │   self.hook(args)                                                                │
│    62 │   │   │   self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages,   │
│ ❱  63 │   │   return do_bench(kernel_call)                                                       │
│    64 │                                                                                          │
│    65 │   def run(self, *args, **kwargs):                                                        │
│    66 │   │   self.nargs = dict(zip(self.arg_names, args))                                       │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/testing.py │
│ :140 in do_bench                                                                                 │
│                                                                                                  │
│   137 │   """                                                                                    │
│   138 │                                                                                          │
│   139 │   # Estimate the runtime of the function                                                 │
│ ❱ 140 │   fn()                                                                                   │
│   141 │   torch.cuda.synchronize()                                                               │
│   142 │   start_event = torch.cuda.Event(enable_timing=True)                                     │
│   143 │   end_event = torch.cuda.Event(enable_timing=True)                                       │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:62 in kernel_call                                                                     │
│                                                                                                  │
│    59 │   │   │   if config.pre_hook:                                                            │
│    60 │   │   │   │   config.pre_hook(self.nargs)                                                │
│    61 │   │   │   self.hook(args)                                                                │
│ ❱  62 │   │   │   self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages,   │
│    63 │   │   return do_bench(kernel_call)                                                       │
│    64 │                                                                                          │
│    65 │   def run(self, *args, **kwargs):                                                        │
│                                                                                                  │
│ /home/abc/miniconda3/envs/torch2-py311/lib/python3.11/site-packages/triton_pre_mlir/runtime/au │
│ totuner.py:200 in run                                                                            │
│                                                                                                  │
│   197 │   def run(self, *args, **kwargs):                                                        │
│   198 │   │   for v, heur in self.values.items():                                                │
│   199 │   │   │   kwargs[v] = heur({**dict(zip(self.arg_names, args)), **kwargs})                │
│ ❱ 200 │   │   return self.fn.run(*args, **kwargs)                                                │
│   201                                                                                            │
│   202                                                                                            │
│   203 def heuristics(values):                                                                    │
│ in _bwd_kernel:43                                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Triton Error [CUDA]: invalid argument
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Closing the engine
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Closing callback ProgressBarLogger

2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Closing callback ConsoleLogger
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Closing callback SpeedMonitor
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Closing callback LRMonitor
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Closing callback MemoryMonitor
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Closing callback RuntimeEstimator
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Closing callback CheckpointSaver
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Post-closing callback ProgressBarLogger
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Post-closing callback ConsoleLogger
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Post-closing callback SpeedMonitor
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Post-closing callback LRMonitor
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Post-closing callback MemoryMonitor
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Post-closing callback RuntimeEstimator
2023-12-06 07:00:11,856: rank1[131910][MainThread]: DEBUG: composer.core.engine: Post-closing callback CheckpointSaver

----------End global rank 1 STDERR----------
ERROR:composer.cli.launcher:Global rank 0 (PID 131909) exited with code -15

Error Encountered

Triton Error [CUDA]: invalid argument

mosaicml / llm-foundry