mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
4.03k stars 526 forks source link

Triton Test Failed: GPU SMs must run at 1350 MHz / GPU memory must run at 877 MHz #244

Closed cocobeach closed 1 year ago

cocobeach commented 1 year ago

Hi,

I am trying to run the tests suite to see if my setup is correct and I am down to 31 failed, 4852 passed etc...

However the ones that failed are strange

Here a partial log, full log below:

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================================================= short test summary info ==========================================================================================FAILED tests/test_data_prep_scripts.py::test_json_script_from_api - FileNotFoundError: Couldn't find a dataset script at /workspace/llm-foundry/jsonl/jsonl.py or any data file in the same directory. Couldn't find 'jsonl' on the Hugging Face Hub either: FileNotFoundEr... FAILED tests/test_hf_conversion_script.py::test_convert_and_generate_torch - NotADirectoryError: [Errno 20] Not a directory: '/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py' FAILED tests/test_model.py::test_full_forward_and_backward_gpt2_small[False] - FileNotFoundError: [Errno 2] No such file or directory: '.scripts/train/yamls/pretrain/gpt2-small.yaml' FAILED tests/test_model.py::test_full_forward_and_backward_gpt2_small[True] - FileNotFoundError: [Errno 2] No such file or directory: '.scripts/train/yamls/pretrain/gpt2-small.yaml' FAILED tests/test_model.py::test_save_from_pretrained - NotADirectoryError: [Errno 20] Not a directory: '/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py' FAILED triton/python/test/regression/test_performance.py::test_matmul[256-256-256-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[512-512-512-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[1024-1024-1024-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[2048-2048-2048-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[4096-4096-4096-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[8192-8192-8192-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[16-1024-1024-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[16-4096-4096-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[16-8192-8192-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[64-1024-1024-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[64-4096-4096-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[64-8192-8192-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[1024-64-1024-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[4096-64-4096-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[8192-64-8192-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[16384] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[65536] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[262144] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[1048576] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[4194304] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[16777216] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[67108864] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/unit/operators/test_blocksparse.py::test_softmax[16-256-False] - AssertionError: FAILED triton/python/test/unit/operators/test_blocksparse.py::test_softmax[32-576-False] - AssertionError: FAILED triton/python/test/unit/operators/test_blocksparse.py::test_softmax[64-1871-False] - AssertionError: FAILED triton/python/test/unit/operators/test_blocksparse.py::test_softmax[128-2511-False] - AssertionError: ==================================================== 31 failed, 4852 passed, 49 skipped, 174 deselected, 16 xfailed, 60 warnings in 1997.10s (0:33:17) =====================================================

Now my config is on HP z840 with a xeon 2620v3 and 2x Nvidia RTX A4000, however as you can see, during the whole tests the cards barely react, with a small load and never running faster than 210Mhz, and I have the error.

I tried to change in flash_attentioninterface.py

def _get_block_size(device, head_dim, is_dropout): assert head_dim % 8 == 0 and head_dim <= 64 return 128 if head_dim <= 32 else 64

Because I read that triton didn't support more than 64 head_dim for sam86 kernels, I belive the Ampere 4000 falls under that, but it changed absolutely nothing.

Also quick question is the code geared towards multi GPU, I believe it would be because for training itself it would take much more time than simply finetuning mpt-7b.

How does it look to the trained eye, is this error a false positive? Is there another way I can test the gpus?

Thanks

Full log below:

================================================================================================= FAILURES =================================================================================================____ test_json_script_from_api _____tests/test_data_prep_scripts.py:41: in test_json_script_from_api main_json( scripts/data_prep/convert_dataset_json.py:210: in main dataset = build_hf_dataset(path=args.path, scripts/data_prep/convert_dataset_json.py:103: in build_hf_dataset hf_dataset = hf_datasets.load_dataset('jsonl', llmfoundryenv/lib/python3.10/site-packages/datasets/load.py:1759: in load_dataset builder_instance = load_dataset_builder( llmfoundryenv/lib/python3.10/site-packages/datasets/load.py:1496: in load_dataset_builder dataset_module = dataset_module_factory( llmfoundryenv/lib/python3.10/site-packages/datasets/load.py:1214: in dataset_modulefactory raise FileNotFoundError( E FileNotFoundError: Couldn't find a dataset script at /workspace/llm-foundry/jsonl/jsonl.py or any data file in the same directory. Couldn't find 'jsonl' on the Hugging Face Hub either: FileNotFoundError: Dataset 'jsonl' doesn't exist on the Hub. If the repo is private or gated, make sure to log in with huggingface-cli login. test_convert_and_generate_torch __tests/test_hf_conversion_script.py:65: in test_convert_and_generate_torch main(args) scripts/inference/convert_composer_to_hf.py:397: in main loaded_hf_model.save_pretrained(local_folder_path) llmfoundryenv/lib/python3.10/site-packages/transformers/modeling_utils.py:1752: in save_pretrained custom_object_save(self, save_directory, config=self.config) llmfoundryenv/lib/python3.10/site-packages/transformers/dynamic_module_utils.py:450: in custom_object_save shutil.copy(object_file, dest_file) /usr/lib/python3.10/shutil.py:417: in copy copyfile(src, dst, follow_symlinks=follow_symlinks) /usr/lib/python3.10/shutil.py:254: in copyfile with open(src, 'rb') as fsrc: E NotADirectoryError: [Errno 20] Not a directory: '/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py' ------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization. Downloading checkpoint from /tmp/pytest-of-root/pytest-1/test_convert_and_generate_torc0/checkpoint.pt -> /tmp/tmpu8uyjuwo/local-composer-checkpoint.pt Loading checkpoint into CPU RAM... ############################## Saving HF Model Config... MPTConfig { "attn_config": { "alibi": false, "alibi_bias_max": 8, "attn_impl": "torch", "attn_pdrop": 0.0, "attn_type": "multihead_attention", "attn_uses_sequence_id": false, "clip_qkv": null, "prefix_lm": false, "qk_ln": false, "softmax_scale": null }, "d_model": 128, "emb_pdrop": 0.0, "embedding_fraction": 1.0, "expansion_ratio": 4, "init_config": { "emb_init_std": null, "emb_init_uniform_lim": null, "fan_mode": "fan_in", "init_div_is_residual": true, "init_gain": 0.0, "init_nonlinearity": "relu", "init_std": null, "name": "kaimingnormal", "verbose": 0 }, "init_device": "cpu", "learned_pos_emb": true, "logit_scale": null, "max_seq_len": 128, "model_type": "mpt", "n_heads": 2, "n_layers": 2, "no_bias": false, "norm_type": "low_precision_layernorm", "resid_pdrop": 0.0, "torch_dtype": "float32", "transformers_version": "4.28.1", "use_cache": false, "verbose": 0, "vocab_size": 50368 }

############################## Saving HF Tokenizer... GPTNeoXTokenizerFast(name_or_path='', vocab_size=50254, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True) ############################## Saving HF Model Weights... ############################## HF checkpoint folder successfully created at /tmp/pytest-of-root/pytest-1/test_convert_and_generate_torc0/hf-output-folder. Done. ############################## Loading model from /tmp/pytest-of-root/pytest-1/test_convert_and_generate_torc0/hf-output-folder You are using config.init_device='cpu', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization. ------------------------------------------------------------------------------------------- Captured stderr call ------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------------------------------INFO composer.models.huggingface:huggingface.py:111 The number of tokens in the tokenizer is less than the number of tokens in the model. You may want to resize the model embeddings to 50277 from 50368 by calling model.resize_token_embeddings(len(tokenizer)) before calling the HuggingFaceModel constructor. The vocab size is sometimes intentionally set to a multiple of 32 or 64 to improve performance. INFO composer.utils.reproducibility:reproducibility.py:159 Setting seed to 3894649697 INFO composer.trainer.trainer:trainer.py:993 Run name: 1685369687-natural-bonobo INFO torch.distributed.nn.jit.instantiator:instantiator.py:21 Created a temporary directory at /tmp/tmpn8cizn5z INFO torch.distributed.nn.jit.instantiator:instantiator.py:76 Writing /tmp/tmpn8cizn5z/_remote_module_non_scriptable.py INFO composer.trainer.trainer:trainer.py:97 Stepping schedulers every batch. To step schedulers every epoch, set step_schedulers_every_batch=False. INFO composer.trainer.trainer:trainer.py:1353 Setting seed to 3894649697 INFO composer.utils.reproducibility:reproducibility.py:159 Setting seed to 3894649697 _ test_full_forward_and_backward_gpt2small[False] tests/test_model.py:201: in test_full_forward_and_backward_gpt2_small with open(confpath) as f: E FileNotFoundError: [Errno 2] No such file or directory: '.scripts/train/yamls/pretrain/gpt2-small.yaml' ____ test_full_forward_and_backward_gpt2_small[True] __tests/test_model.py:201: in test_full_forward_and_backward_gpt2_small with open(conf_path) as f: E FileNotFoundError: [Errno 2] No such file or directory: '.scripts/train/yamls/pretrain/gpt2-small.yaml' ____ test_save_from_pretrained _tests/test_model.py:793: in test_save_from_pretrained mpt.save_pretrained(tmp_path / 'test-save-pretrained') llmfoundryenv/lib/python3.10/site-packages/transformers/modeling_utils.py:1752: in save_pretrained custom_object_save(self, save_directory, config=self.config) llmfoundryenv/lib/python3.10/site-packages/transformers/dynamic_module_utils.py:450: in custom_object_save shutil.copy(object_file, dest_file) /usr/lib/python3.10/shutil.py:417: in copy copyfile(src, dst, follow_symlinks=follow_symlinks) /usr/lib/python3.10/shutil.py:254: in copyfile with open(src, 'rb') as fsrc: E NotADirectoryError: [Errno 20] Not a directory: '/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py' ------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------You are using config.init_device='cpu', but you can also use config.initdevice="meta" with Composer + FSDP for fast initialization. ____ test_matmul[256-256-256-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_smclock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) test_matmul[512-512-512-float16] ___triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_smclock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) test_matmul[1024-1024-1024-float16] ____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) _ test_matmul[2048-2048-2048-float16] __triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) _ test_matmul[4096-4096-4096-float16] __triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) _ test_matmul[8192-8192-8192-float16] __triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[16-1024-1024-float16] _triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[16-4096-4096-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[16-8192-8192-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[64-1024-1024-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[64-4096-4096-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[64-8192-8192-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[1024-64-1024-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[4096-64-4096-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_sm_clock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) ____ test_matmul[8192-64-8192-float16] _____triton/python/test/regression/test_performance.py:93: in test_matmul assert abs(cur_sm_clock - ref_sm_clock) < 10, f'GPU SMs must run at {ref_smclock} MHz' E AssertionError: GPU SMs must run at 1350 MHz E assert 1140 < 10 E + where 1140 = abs((210 - 1350)) test_elementwise[16384] __triton/python/test/regression/test_performance.py:155: in test_elementwise assert abs(cur_mem_clock - ref_mem_clock) < 10, f'GPU memory must run at {ref_memclock} MHz' E AssertionError: GPU memory must run at 877 MHz E assert 472 < 10 E + where 472 = abs((405 - 877)) ____ test_elementwise[65536] __triton/python/test/regression/test_performance.py:155: in test_elementwise assert abs(cur_mem_clock - ref_mem_clock) < 10, f'GPU memory must run at {ref_memclock} MHz' E AssertionError: GPU memory must run at 877 MHz E assert 472 < 10 E + where 472 = abs((405 - 877)) ____ test_elementwise[262144] _triton/python/test/regression/test_performance.py:155: in test_elementwise assert abs(cur_mem_clock - ref_mem_clock) < 10, f'GPU memory must run at {ref_mem_clock} MHz' E AssertionError: GPU memory must run at 877 MHz E assert 472 < 10 E + where 472 = abs((405 - 877)) ____ test_elementwise[1048576] _____triton/python/test/regression/test_performance.py:155: in test_elementwise assert abs(cur_mem_clock - ref_mem_clock) < 10, f'GPU memory must run at {ref_mem_clock} MHz' E AssertionError: GPU memory must run at 877 MHz E assert 472 < 10 E + where 472 = abs((405 - 877)) ____ test_elementwise[4194304] _____triton/python/test/regression/test_performance.py:155: in test_elementwise assert abs(cur_mem_clock - ref_mem_clock) < 10, f'GPU memory must run at {ref_mem_clock} MHz' E AssertionError: GPU memory must run at 877 MHz E assert 472 < 10 E + where 472 = abs((405 - 877)) ____ test_elementwise[16777216] triton/python/test/regression/test_performance.py:155: in test_elementwise assert abs(cur_mem_clock - ref_mem_clock) < 10, f'GPU memory must run at {ref_mem_clock} MHz' E AssertionError: GPU memory must run at 877 MHz E assert 472 < 10 E + where 472 = abs((405 - 877)) ____ test_elementwise[67108864] ____triton/python/test/regression/test_performance.py:155: in test_elementwise assert abs(cur_mem_clock - ref_mem_clock) < 10, f'GPU memory must run at {ref_mem_clock} MHz' E AssertionError: GPU memory must run at 877 MHz E assert 472 < 10 E + where 472 = abs((405 - 877)) ____ test_softmax[16-256-False] ____triton/python/test/unit/operators/test_blocksparse.py:115: in test_softmax triton.testing.assert_almost_equal(da_tri, da_ref) llmfoundryenv/lib/python3.10/site-packages/triton_pre_mlir/testing.py:90: in assert_almost_equal npt.assert_array_almost_equal(x, y, err_msg=err_msg, decimal=decimal) /usr/lib/python3.10/contextlib.py:79: in inner return func(*args, kwds) /usr/lib/python3.10/contextlib.py:79: in inner return func(*args, *kwds) E AssertionError: E Arrays are not almost equal to 2 decimals E
E x and y nan location mismatch: E x: array([[[[ 0.00e+00, 0.00e+00, 0.00e+00, ..., 0.00e+00, 0.00e+00, E 0.00e+00], E [ 0.00e+00, 0.00e+00, 0.00e+00, ..., 0.00e+00, 0.00e+00,... E y: array([[[[ nan, nan, nan, ..., nan, nan, E nan], E [ nan, nan, nan, ..., nan, nan,... ____ test_softmax[32-576-False] ____triton/python/test/unit/operators/test_blocksparse.py:115: in test_softmax triton.testing.assert_almost_equal(da_tri, da_ref) llmfoundryenv/lib/python3.10/site-packages/triton_pre_mlir/testing.py:90: in assert_almost_equal npt.assert_array_almost_equal(x, y, err_msg=err_msg, decimal=decimal) /usr/lib/python3.10/contextlib.py:79: in inner return func(
args,
kwds) /usr/lib/python3.10/contextlib.py:79: in inner return func(*args, kwds) E AssertionError: E Arrays are not almost equal to 2 decimals E
E x and y nan location mismatch: E x: array([[[[ 0.00e+00, 0.00e+00, 0.00e+00, ..., 0.00e+00, 0.00e+00, E 0.00e+00], E [ 0.00e+00, 0.00e+00, 0.00e+00, ..., 0.00e+00, 0.00e+00,... E y: array([[[[ nan, nan, nan, ..., nan, nan, E nan], E [ nan, nan, nan, ..., nan, nan,... _ test_softmax[64-1871-False] __triton/python/test/unit/operators/test_blocksparse.py:115: in test_softmax triton.testing.assert_almost_equal(da_tri, da_ref) llmfoundryenv/lib/python3.10/site-packages/triton_pre_mlir/testing.py:90: in assert_almost_equal npt.assert_array_almost_equal(x, y, err_msg=err_msg, decimal=decimal) /usr/lib/python3.10/contextlib.py:79: in inner return func(*args, *kwds) /usr/lib/python3.10/contextlib.py:79: in inner return func(args,
kwds) E AssertionError: E Arrays are not almost equal to 2 decimals E
E x and y nan location mismatch: E x: array([[[[ 0.00e+00, 0.00e+00, 0.00e+00, ..., 0.00e+00, 0.00e+00, E 0.00e+00], E [ 0.00e+00, 0.00e+00, 0.00e+00, ..., 0.00e+00, 0.00e+00,... E y: array([[[[ nan, nan, nan, ..., nan, nan, E nan], E [ nan, nan, nan, ..., nan, nan,... ___ test_softmax[128-2511-False] ___triton/python/test/unit/operators/test_blocksparse.py:115: in test_softmax triton.testing.assert_almost_equal(da_tri, da_ref) llmfoundryenv/lib/python3.10/site-packages/triton_pre_mlir/testing.py:90: in assert_almost_equal npt.assert_array_almost_equal(x, y, err_msg=err_msg, decimal=decimal) /usr/lib/python3.10/contextlib.py:79: in inner return func(*args, *kwds) /usr/lib/python3.10/contextlib.py:79: in inner return func(args, **kwds) E AssertionError: E Arrays are not almost equal to 2 decimals E
E x and y nan location mismatch: E x: array([[[[ 0.00e+00, 0.00e+00, 0.00e+00, ..., 0.00e+00, 0.00e+00, E 0.00e+00], E [ 0.00e+00, 0.00e+00, 0.00e+00, ..., 0.00e+00, 0.00e+00,... E y: array([[[[ nan, nan, nan, ..., nan, nan, E nan], E [ nan, nan, nan, ..., nan, nan,... ============================================================================================= warnings summary =============================================================================================tests/test_dataloader.py::test_correct_padding[True-facebook/opt-125m] /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/data/data.py:97: UserWarning: The provided tokenizer adds special tokens, but you also specified bos_text. This may result in duplicated special tokens. Please be sure this is what you intend.

tests/test_hf_mpt_gen.py::test_init_hfhub_mpt_cpu /root/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b/4ff95c4aec5c04ba509ddf517c56720541a7a487/attention.py:157: UserWarning: Using attn_impl: torch. If your model does not use alibi or prefix_lm we recommend using attn_impl: flash otherwise we recommend using attn_impl: triton. warnings.warn('Using attn_impl: torch. If your model does not use alibi or ' + 'prefix_lm we recommend using attn_impl: flash otherwise ' + 'we recommend using attn_impl: triton.')

tests/test_init_fn.py::test_emb_init[emb_init_cfg2] /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/utils/param_init_fns.py:116: UserWarning: Embedding layer initialized to 0. warnings.warn(f'Embedding layer initialized to 0.')

tests/test_init_fn.py::test_emb_init[emb_init_cfg5] /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/utils/param_init_fns.py:133: UserWarning: Embedding layer initialized to 0. warnings.warn(f'Embedding layer initialized to 0.')

tests/test_init_fn.py::test_emb_init[emb_init_cfg6] /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/utils/param_init_fns.py:130: UserWarning: Embedding layer initialized to 1. warnings.warn(f'Embedding layer initialized to {lim[0]}.')

tests/test_model.py::test_generation_kwargs_dont_crash[True-generation_kwargs2] tests/test_model.py::test_generation_kwargs_dont_crash[False-generation_kwargs2] /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/transformers/generation/utils.py:1313: UserWarning: Using max_length's default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using max_new_tokens to control the maximum length of the generation. warnings.warn(

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py:300: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py:326: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py:201: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/layers/attention.py:76: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/layers/attention.py:77: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/layers/attention.py:80: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/layers/attention.py:81: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/layers/attention.py:102: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

tests/test_onnx.py::test_onnx_export /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/layers/attention.py:103: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!

triton/python/test/unit/language/test_core.py: 38 warnings /workspace/llm-foundry/triton/python/test/unit/language/test_core.py:210: RuntimeWarning: overflow encountered in cast z_ref = z_ref.astype(dtype_z)

triton/python/test/unit/language/test_core.py::test_atomic_rmw[add-uint32-min_neg] triton/python/test/unit/language/test_core.py::test_atomic_rmw[max-uint32-min_neg] triton/python/test/unit/language/test_core.py::test_atomic_rmw[min-uint32-min_neg] /workspace/llm-foundry/triton/python/test/unit/language/test_core.py:636: RuntimeWarning: overflow encountered in scalar negative x[idx] = -np.max(np.abs(x)) - 1

triton/python/test/unit/language/test_random.py::test_randint[10-16045690984503095482] triton/python/test/unit/language/test_random.py::test_randint[4,53-16045690984503095482] triton/python/test/unit/language/test_random.py::test_randint[10000-16045690984503095482] /workspace/llm-foundry/triton/python/test/unit/language/test_random.py:56: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays. The conversion of 16045690984503095482 to uint32 will fail in the future. For the old behavior, usually: np.array(value).astype(dtype)` will give the desired result (the cast overflows). res.append(np.array(n, dtype=self._dtype))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================================================= short test summary info ==========================================================================================FAILED tests/test_data_prep_scripts.py::test_json_script_from_api - FileNotFoundError: Couldn't find a dataset script at /workspace/llm-foundry/jsonl/jsonl.py or any data file in the same directory. Couldn't find 'jsonl' on the Hugging Face Hub either: FileNotFoundEr... FAILED tests/test_hf_conversion_script.py::test_convert_and_generate_torch - NotADirectoryError: [Errno 20] Not a directory: '/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py' FAILED tests/test_model.py::test_full_forward_and_backward_gpt2_small[False] - FileNotFoundError: [Errno 2] No such file or directory: '.scripts/train/yamls/pretrain/gpt2-small.yaml' FAILED tests/test_model.py::test_full_forward_and_backward_gpt2_small[True] - FileNotFoundError: [Errno 2] No such file or directory: '.scripts/train/yamls/pretrain/gpt2-small.yaml' FAILED tests/test_model.py::test_save_from_pretrained - NotADirectoryError: [Errno 20] Not a directory: '/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/llm_foundry-0.1.0-py3.10.egg/llmfoundry/models/mpt/modeling_mpt.py' FAILED triton/python/test/regression/test_performance.py::test_matmul[256-256-256-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[512-512-512-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[1024-1024-1024-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[2048-2048-2048-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[4096-4096-4096-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[8192-8192-8192-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[16-1024-1024-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[16-4096-4096-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[16-8192-8192-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[64-1024-1024-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[64-4096-4096-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[64-8192-8192-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[1024-64-1024-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[4096-64-4096-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_matmul[8192-64-8192-float16] - AssertionError: GPU SMs must run at 1350 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[16384] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[65536] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[262144] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[1048576] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[4194304] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[16777216] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/regression/test_performance.py::test_elementwise[67108864] - AssertionError: GPU memory must run at 877 MHz FAILED triton/python/test/unit/operators/test_blocksparse.py::test_softmax[16-256-False] - AssertionError: FAILED triton/python/test/unit/operators/test_blocksparse.py::test_softmax[32-576-False] - AssertionError: FAILED triton/python/test/unit/operators/test_blocksparse.py::test_softmax[64-1871-False] - AssertionError: FAILED triton/python/test/unit/operators/test_blocksparse.py::test_softmax[128-2511-False] - AssertionError: ==================================================== 31 failed, 4852 passed, 49 skipped, 174 deselected, 16 xfailed, 60 warnings in 1997.10s (0:33:17) =====================================================

tginart commented 1 year ago

Not sure if this is helpful but I was only able to get Triton's flash attention to work on an A100. I tried H100, A10, A6000... & nope.

cocobeach commented 1 year ago

I gave up on the tests and went for a training, it worked but after the training I have a traceback error that seems to originate from this line in the file fully_sharded_data_parallel.py:

File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1637, in _all_gather_optim_state for name, non_tensor_value in object_state.non_tensors.items(): AttributeError: 'int' object has no attribute 'items'

This line is trying to iterate over the items of object_state.non_tensors, but it's encountering an AttributeError because object_state.non_tensors is an integer, while integers won't have an items method.

Any ideas, CUDA is 11.7, Pytorch is 11.7

this is the full env: aiohttp==3.8.4 aiosignal==1.3.1 antlr4-python3-runtime==4.9.3 apache-libcloud==3.7.0 appdirs==1.4.4 argcomplete==3.0.8 arrow==1.2.3 async-timeout==4.0.2 attrs==23.1.0 backoff==2.2.1 bcrypt==4.0.1 boto3==1.26.142 botocore==1.29.142 Brotli==1.0.9 certifi==2023.5.7 cffi==1.15.1 charset-normalizer==3.1.0 circuitbreaker==1.4.0 click==8.1.3 cmake==3.26.3 coloredlogs==15.0.1 composer==0.14.1 contourpy==1.0.7 coolname==2.2.0 cryptography==39.0.2 cycler==0.11.0 datasets==2.10.1 decorator==5.1.1 dill==0.3.6 docker==6.1.2 docker-pycreds==0.4.0 einops==0.5.0 exceptiongroup==1.1.1 filelock==3.12.0 flash-attn==1.0.3.post0 flatbuffers==23.5.26 fonttools==4.39.4 frozenlist==1.3.3 fsspec==2023.5.0 gitdb==4.0.10 GitPython==3.1.31 gql==3.4.1 graphql-core==3.2.3 huggingface-hub==0.14.1 humanfriendly==10.0 idna==3.4 importlib-metadata==6.6.0 iniconfig==2.0.0 Jinja2==3.1.2 jmespath==1.0.1 kiwisolver==1.4.4 lit==16.0.5 llm-foundry==0.1.0 markdown-it-py==2.2.0 MarkupSafe==2.1.2 matplotlib==3.7.1 mdurl==0.1.2 mosaicml-cli==0.4.4 mosaicml-streaming==0.4.1 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.14 networkx==3.1 numpy==1.24.3 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 oci==2.103.0 omegaconf==2.3.0 onnx==1.13.1 onnxruntime==1.14.1 packaging==22.0 pandas==2.0.1 paramiko==3.2.0 pathtools==0.1.2 Pillow==9.5.0 pluggy==1.0.0 prompt-toolkit==3.0.38 protobuf==3.20.3 psutil==5.9.5 py-cpuinfo==9.0.0 pyarrow==12.0.0 pycparser==2.21 Pygments==2.15.1 PyNaCl==1.5.0 pyOpenSSL==23.1.1 pyparsing==3.0.9 pytest==7.3.1 python-dateutil==2.8.2 python-snappy==0.6.1 pytorch-ranger==0.1.1 pytz==2023.3 PyYAML==6.0 questionary==1.10.0 regex==2023.5.5 requests==2.31.0 responses==0.18.0 rich==13.3.5 ruamel.yaml==0.17.28 ruamel.yaml.clib==0.2.7 s3transfer==0.6.1 scipy==1.10.1 sentencepiece==0.1.97 sentry-sdk==1.24.0 setproctitle==1.3.2 six==1.16.0 slack-sdk==3.21.3 smmap==5.0.0 sympy==1.12 tabulate==0.9.0 tokenizers==0.13.3 tomli==2.0.1 torch==2.0.1 torch-optimizer==0.3.0 torchdata==0.6.1 torchmetrics==0.11.3 torchtext==0.15.2 torchvision==0.15.2 tqdm==4.65.0 transformers==4.28.1 triton==2.0.0 triton-pre-mlir @ git+https://github.com/vchiley/triton.git@2dd3b957698a39bbca615c02a447a98482c144a3#subdirectory=python typing_extensions==4.6.2 tzdata==2023.3 urllib3==1.26.16 validators==0.20.0 wandb==0.15.3 wcwidth==0.2.6 websocket-client==1.5.2 websockets==10.4 xentropy-cuda-lib @ git+https://github.com/HazyResearch/flash-attention.git@33e0860c9c5667fded5af674882e731909096a7f#subdirectory=csrc/xentropy xxhash==3.2.0 yarl==1.9.2 zipp==3.15.0 zstd==1.5.5.1

cocobeach commented 1 year ago

ANd this is the run and the final error, I am running ion 8bits, but still tweaking the other values:

SystemExit: 143 wandb: wandb: Run history: wandb: loss/train/total ███▆▄▃▃▂▂▁ wandb: lr-DecoupledAdamW/group0 ▁▂▃▃▄▅▆▆▇█ wandb: memory/active_mem ▁█████████ wandb: memory/alloc_retries ▁▁▁▁▁▁▁▁▁▁ wandb: memory/allocated_mem ▁█████████ wandb: memory/inactive_mem ▁█████████ wandb: memory/reserved_mem ▁▁▁▁▁▁▁▁▁▁ wandb: metrics/train/LanguageCrossEntropy ███▆▄▃▃▂▂▁ wandb: metrics/train/LanguagePerplexity ███▄▃▂▂▂▁▁ wandb: time/batch ▁▂▃▃▄▅▆▆▇█ wandb: time/batch_in_epoch ▁▂▃▃▄▅▆▆▇█ wandb: time/epoch ▁ wandb: time/remaining_estimate █▇▆▅▅▄▃▂▁ wandb: time/sample ▁▂▃▃▄▅▆▆▇█ wandb: time/sample_in_epoch ▁▂▃▃▄▅▆▆▇█ wandb: time/token ▁▂▃▃▄▅▆▆▇█ wandb: time/token_in_epoch ▁▂▃▃▄▅▆▆▇█ wandb: time/total ▁▂▃▃▄▅▆▆▇█ wandb: time/train ▁▂▃▃▄▅▆▆▇█ wandb: time/val ▁▁▁▁▁▁▁▁▁▁ wandb: trainer/device_train_microbatch_size ▁▁▁▁▁▁▁▁▁▁ wandb: wandb: Run summary: wandb: loss/train/total 9.69673 wandb: lr-DecoupledAdamW/group0 5e-05 wandb: memory/active_mem 1.9783 wandb: memory/alloc_retries 0 wandb: memory/allocated_mem 1.9783 wandb: memory/inactive_mem 1.0269 wandb: memory/reserved_mem 8.5732 wandb: metrics/train/LanguageCrossEntropy 9.69674 wandb: metrics/train/LanguagePerplexity 16264.45312 wandb: time/batch 9 wandb: time/batch_in_epoch 9 wandb: time/epoch 0 wandb: time/remaining_estimate 0.0 wandb: time/sample 2304 wandb: time/sample_in_epoch 2304 wandb: time/token 2359296 wandb: time/token_in_epoch 2359296 wandb: time/total 0.03553 wandb: time/train 0.03553 wandb: time/val 0.0 wandb: trainer/device_train_microbatch_size 8 wandb: wandb: 🚀 View run llm at: https://wandb.ai/maxtensor/llm-foundry-scripts_train/runs/f9gaps6p wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230529_213452-f9gaps6p/logs Global rank 0 (PID 28598) exited with code 1 Global rank 1 (PID 28599) exited with code 1 ----------Begin global rank 1 STDOUT---------- Initializing model... cfg.n_params=1.25e+08 Building train loader... Building eval loader... Building trainer... Logging config... data_local: my-copy-c4 data_remote: null max_seq_len: 1024 global_seed: 10 run_name: llm model: name: mpt_causal_lm init_device: meta d_model: 768 n_heads: 12 n_layers: 12 expansion_ratio: 4 max_seq_len: ${max_seq_len} vocab_size: 50368 attn_config: attn_impl: triton tokenizer: name: EleutherAI/gpt-neox-20b kwargs: model_max_length: ${max_seq_len} train_loader: name: text dataset: local: ${data_local} remote: ${data_remote} split: train_small shuffle: true max_seq_len: ${max_seq_len} shuffle_seed: ${global_seed} drop_last: true num_workers: 6 eval_loader: name: text dataset: local: ${data_local} remote: ${data_remote} split: val_small shuffle: false max_seq_len: ${max_seq_len} shuffle_seed: ${global_seed} drop_last: false num_workers: 6 scheduler: name: cosine_with_warmup t_warmup: 100ba alpha_f: 0.1 optimizer: name: decoupled_adamw lr: 0.0006 betas:

Starting training...

----------End global rank 1 STDOUT---------- ----------Begin global rank 1 STDERR---------- /workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor. warnings.warn( [batch=1/10]: Train time/epoch: 0 Train time/batch: 0 Train time/sample: 0 Train time/batch_in_epoch: 0 Train time/sample_in_epoch: 0 Train time/token: 0 Train time/token_in_epoch: 0 Train memory/allocated_mem: 1.4586 Train memory/active_mem: 1.4586 Train memory/inactive_mem: 0.8357 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 11.6153 Train metrics/train/LanguageCrossEntropy: 11.6153 Train metrics/train/LanguagePerplexity: 110783.0234 Train time/train: 0.0073 Train time/val: 0.0000 Train time/total: 0.0073 Train lr-DecoupledAdamW/group0: 0.0000 [batch=2/10]: Train time/batch: 1 Train time/sample: 256 Train time/batch_in_epoch: 1 Train time/sample_in_epoch: 256 Train time/token: 262144 Train time/token_in_epoch: 262144 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 11.6203 Train metrics/train/LanguageCrossEntropy: 11.6203 Train metrics/train/LanguagePerplexity: 111336.3203 Train time/train: 0.0104 Train time/val: 0.0000 Train time/total: 0.0104 Train lr-DecoupledAdamW/group0: 0.0000 Train time/remaining_estimate: 0.0248 [batch=3/10]: Train time/batch: 2 Train time/sample: 512 Train time/batch_in_epoch: 2 Train time/sample_in_epoch: 512 Train time/token: 524288 Train time/token_in_epoch: 524288 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 11.6143 Train metrics/train/LanguageCrossEntropy: 11.6143 Train metrics/train/LanguagePerplexity: 110665.7031 Train time/train: 0.0135 Train time/val: 0.0000 Train time/total: 0.0135 Train lr-DecoupledAdamW/group0: 0.0000 Train time/remaining_estimate: 0.0218 [batch=4/10]: Train time/batch: 3 Train time/sample: 768 Train time/batch_in_epoch: 3 Train time/sample_in_epoch: 768 Train time/token: 786432 Train time/token_in_epoch: 786432 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 11.0220 Train metrics/train/LanguageCrossEntropy: 11.0220 Train metrics/train/LanguagePerplexity: 61206.2227 Train time/train: 0.0167 Train time/val: 0.0000 Train time/total: 0.0167 Train lr-DecoupledAdamW/group0: 0.0000 Train time/remaining_estimate: 0.0187 [batch=5/10]: Train time/batch: 4 Train time/sample: 1024 Train time/batch_in_epoch: 4 Train time/sample_in_epoch: 1024 Train time/token: 1048576 Train time/token_in_epoch: 1048576 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 10.5514 Train metrics/train/LanguageCrossEntropy: 10.5514 Train metrics/train/LanguagePerplexity: 38232.6797 Train time/train: 0.0198 Train time/val: 0.0000 Train time/total: 0.0198 Train lr-DecoupledAdamW/group0: 0.0000 Train time/remaining_estimate: 0.0156 [batch=6/10]: Train time/batch: 5 Train time/sample: 1280 Train time/batch_in_epoch: 5 Train time/sample_in_epoch: 1280 Train time/token: 1310720 Train time/token_in_epoch: 1310720 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 10.3442 Train metrics/train/LanguageCrossEntropy: 10.3442 Train metrics/train/LanguagePerplexity: 31076.6367 Train time/train: 0.0230 Train time/val: 0.0000 Train time/total: 0.0230 Train lr-DecoupledAdamW/group0: 0.0000 Train time/remaining_estimate: 0.0125 [batch=7/10]: Train time/batch: 6 Train time/sample: 1536 Train time/batch_in_epoch: 6 Train time/sample_in_epoch: 1536 Train time/token: 1572864 Train time/token_in_epoch: 1572864 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 10.1905 Train metrics/train/LanguageCrossEntropy: 10.1905 Train metrics/train/LanguagePerplexity: 26649.7637 Train time/train: 0.0261 Train time/val: 0.0000 Train time/total: 0.0261 Train lr-DecoupledAdamW/group0: 0.0000 Train time/remaining_estimate: 0.0094 [batch=8/10]: Train time/batch: 7 Train time/sample: 1792 Train time/batch_in_epoch: 7 Train time/sample_in_epoch: 1792 Train time/token: 1835008 Train time/token_in_epoch: 1835008 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 10.0649 Train metrics/train/LanguageCrossEntropy: 10.0649 Train metrics/train/LanguagePerplexity: 23502.5723 Train time/train: 0.0293 Train time/val: 0.0000 Train time/total: 0.0293 Train lr-DecoupledAdamW/group0: 0.0000 Train time/remaining_estimate: 0.0063 [batch=9/10]: Train time/batch: 8 Train time/sample: 2048 Train time/batch_in_epoch: 8 Train time/sample_in_epoch: 2048 Train time/token: 2097152 Train time/token_in_epoch: 2097152 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 9.8342 Train metrics/train/LanguageCrossEntropy: 9.8342 Train metrics/train/LanguagePerplexity: 18660.9160 Train time/train: 0.0324 Train time/val: 0.0000 Train time/total: 0.0324 Train lr-DecoupledAdamW/group0: 0.0000 Train time/remaining_estimate: 0.0031 [batch=10/10]: Train time/batch: 9 Train time/sample: 2304 Train time/batch_in_epoch: 9 Train time/sample_in_epoch: 2304 Train time/token: 2359296 Train time/token_in_epoch: 2359296 Train memory/allocated_mem: 1.9580 Train memory/active_mem: 1.9580 Train memory/inactive_mem: 1.0283 Train memory/reserved_mem: 8.5794 Train memory/alloc_retries: 0 Train trainer/device_train_microbatch_size: 8 Train loss/train/total: 9.6967 Train metrics/train/LanguageCrossEntropy: 9.6967 Train metrics/train/LanguagePerplexity: 16264.4531 Train time/train: 0.0355 Train time/val: 0.0000 Train time/total: 0.0355 Train lr-DecoupledAdamW/group0: 0.0001 Train time/remaining_estimate: 0.0000 Traceback (most recent call last): File "/workspace/llm-foundry/scripts/train/train.py", line 254, in main(cfg) File "/workspace/llm-foundry/scripts/train/train.py", line 243, in main trainer.fit() File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1766, in fit self._train_loop() File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 1996, in _train_loop self.engine.run_event(Event.BATCH_CHECKPOINT) File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/core/engine.py", line 293, in run_event self._run_nonlogger_callbacks(event) File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/core/engine.py", line 475, in _run_nonlogger_callbacks self._run_callbacks(event, callbacks) File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/core/engine.py", line 467, in _run_callbacks cb.run_event(event, self.state, self.logger) File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/core/callback.py", line 96, in run_event return event_cb(state, logger) File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/callbacks/checkpoint_saver.py", line 346, in batch_checkpoint self._save_checkpoint( File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/callbacks/checkpoint_saver.py", line 384, in _save_checkpoint saved_path = checkpoint.save_checkpoint( File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/utils/checkpoint.py", line 518, in save_checkpoint 'state': state.state_dict(), File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/core/state.py", line 802, in state_dict fsdp_get_optim_state_dict(self.model, optimizer, state_dict_type=self.fsdp_state_dict_type) File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/composer/core/state.py", line 127, in fsdp_get_optim_state_dict optim_state_dict = FSDP.optim_state_dict(model, optim) # type: ignore File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1753, in optim_state_dict return FullyShardedDataParallel._optim_state_dict_impl( File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1154, in _optim_state_dict_impl return _optim_state_dict( File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1455, in _optim_state_dict _gather_orig_param_state( File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1690, in _gather_orig_param_state gathered_state = _all_gather_optim_state(fsdp_state, optim_state) File "/workspace/llm-foundry/llmfoundryenv/lib/python3.10/site-packages/torch/distributed/fsdp/_optim_utils.py", line 1637, in _all_gather_optim_state for name, non_tensor_value in object_state.non_tensors.items(): AttributeError: 'int' object has no attribute 'items'

----------End global rank 1 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 28598) exited with code 1 (llmfoundryenv) root@fe708568bac8:/workspace/llm-foundry/scripts#

vchiley commented 1 year ago

The

for name, non_tensor_value in object_state.non_tensors.items():
AttributeError: 'int' object has no attribute 'items'

issue is a known issue when using torch2 and the issue is fixed in composer's dev branch / will be updtd in next release of composer.

vchiley commented 1 year ago

@tginart I've been able to run on the triton impl of flash attn (attn_impl: triton) A100s and H100s since this was merged. I think ppl have run it on A10s but I haven't so I cannot verify. I don't know of anyone else who has tried running it on A6000

vchiley commented 1 year ago

Triton Test Failing

I've messed about with Triton, but am no experts. Not sure why the test are or are not passing. We only use it for attn which we test here