huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.55k stars 393 forks source link

[process exited with code 1 (0x00000001)] #63

Open patchie opened 10 months ago

patchie commented 10 months ago

Just wanted to report a crash while training.

Error message: [process exited with code 1 (0x00000001)]

Command i used to start the process: ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true --gradient_accumulation_steps=1024 --per_device_eval_batch_size=1 --per_device_train_batch_size=1

Explanation: Ran the process for several days, then my wife disconnected my laptop from the power source and moved the pc from the livingroom to another room(as the pc was so noisy), and then it seemed to crash. Not sure if it was triggered by the power source disconnect, or if it just happened around that time.

I will just try to run it again.

Log:

2023-11-27 15:41:40 - INFO - main - Load pretrained model 2023-11-27 15:41:40 - INFO - main - Model loaded! /usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py:145: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an AutoModelForCausalLM or a PeftModel (if you passed a peft_config) for you. warnings.warn( [INFO|configuration_utils.py:717] 2023-11-27 15:41:40,964 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/config.json [INFO|configuration_utils.py:777] 2023-11-27 15:41:40,964 >> Model config MistralConfig { "_name_or_path": "mistralai/Mistral-7B-v0.1", "architectures": [ "MistralForCausalLM" ], "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 32768, "model_type": "mistral", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "rms_norm_eps": 1e-05, "rope_theta": 10000.0, "sliding_window": 4096, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.35.0", "use_cache": false, "vocab_size": 32000 }

[INFO|modeling_utils.py:3121] 2023-11-27 15:41:40,972 >> loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/pytorch_model.bin.index.json [INFO|modeling_utils.py:3184] 2023-11-27 15:41:40,974 >> Will use torch_dtype=torch.bfloat16 as defined in model's config object [INFO|modeling_utils.py:1222] 2023-11-27 15:41:40,974 >> Instantiating MistralForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:791] 2023-11-27 15:41:40,976 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "use_cache": false }

[INFO|modeling_utils.py:3257] 2023-11-27 15:41:41,631 >> Detected 4-bit loading: activating 4-bit loading for this model Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.75s/it][INFO|modeling_utils.py:3950] 2023-11-27 15:41:51,332 >> All model checkpoint weights were used when initializing MistralForCausalLM.

[INFO|modeling_utils.py:3958] 2023-11-27 15:41:51,332 >> All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-v0.1. If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training. [INFO|configuration_utils.py:751] 2023-11-27 15:41:51,488 >> loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658/generation_config.json [INFO|configuration_utils.py:791] 2023-11-27 15:41:51,488 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2 }

[INFO|training_args.py:1784] 2023-11-27 15:41:51,646 >> PyTorch: setting up devices /usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py:247: UserWarning: You passed a tokenizer with padding_side not equal to right to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding tokenizer.padding_side = 'right' to your code. warnings.warn( [INFO|trainer.py:593] 2023-11-27 15:41:52,619 >> Using auto half precision backend 2023-11-27 15:41:52 - INFO - main - Train [INFO|trainer.py:1723] 2023-11-27 15:41:53,614 >> Running training [INFO|trainer.py:1724] 2023-11-27 15:41:53,614 >> Num examples = 207,865 [INFO|trainer.py:1725] 2023-11-27 15:41:53,614 >> Num Epochs = 1 [INFO|trainer.py:1726] 2023-11-27 15:41:53,614 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1729] 2023-11-27 15:41:53,614 >> Total train batch size (w. parallel, distributed & accumulation) = 1,024 [INFO|trainer.py:1730] 2023-11-27 15:41:53,614 >> Gradient Accumulation steps = 1024 [INFO|trainer.py:1731] 2023-11-27 15:41:53,614 >> Total optimization steps = 202 [INFO|trainer.py:1732] 2023-11-27 15:41:53,616 >> Number of trainable parameters = 54,525,952 0%| | 0/202 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-27 15:41:54,956 >> Token indices sequence length is longer than the specified maximum sequence length for this model (2377 > 2048). Running this sequence through the model will result in indexing errors [WARNING|logging.py:314] 2023-11-27 15:41:55,018 >> You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding. [WARNING|logging.py:329] 2023-11-27 15:41:55,763 >> The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16. [W reducer.cpp:1346] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator()) 0%|▌ | 1/202 [4:36:47<927:14:16, 16607.25s/it]{'loss': 1.1453, 'learning_rate': 1.9998790632601496e-05, 'epoch': 0.0} {'loss': 1.1416, 'learning_rate': 1.9969780438256295e-05, 'epoch': 0.02} {'loss': 1.1346, 'learning_rate': 1.987930439740757e-05, 'epoch': 0.05} {'loss': 1.1079, 'learning_rate': 1.9729118706714377e-05, 'epoch': 0.07} {'loss': 1.0977, 'learning_rate': 1.95201310753273e-05, 'epoch': 0.1} {'loss': 1.0881, 'learning_rate': 1.925360460617242e-05, 'epoch': 0.12} {'loss': 1.0713, 'learning_rate': 1.8931150161867917e-05, 'epoch': 0.15} {'loss': 1.0523, 'learning_rate': 1.855471662881164e-05, 'epoch': 0.17} {'loss': 1.0533, 'learning_rate': 1.8126579138282502e-05, 'epoch': 0.2} {'loss': 1.0427, 'learning_rate': 1.764932531574648e-05, 'epoch': 0.22} {'loss': 1.0307, 'learning_rate': 1.7125839641475074e-05, 'epoch': 0.25} {'loss': 1.0395, 'learning_rate': 1.65592860169994e-05, 'epoch': 0.27} {'loss': 1.0268, 'learning_rate': 1.595308864276666e-05, 'epoch': 0.3} {'loss': 1.0304, 'learning_rate': 1.531091132257275e-05, 'epoch': 0.32} {'loss': 1.0264, 'learning_rate': 1.4636635319853274e-05, 'epoch': 0.34} {'loss': 1.0232, 'learning_rate': 1.3934335899667526e-05, 'epoch': 0.37} {'loss': 1.0094, 'learning_rate': 1.3208257698153677e-05, 'epoch': 0.39} {'loss': 1.0238, 'learning_rate': 1.2462789068320016e-05, 'epoch': 0.42} {'loss': 1.013, 'learning_rate': 1.1702435557223988e-05, 'epoch': 0.44} {'loss': 1.022, 'learning_rate': 1.0931792674840718e-05, 'epoch': 0.47} {'loss': 1.0153, 'learning_rate': 1.0155518119203511e-05, 'epoch': 0.49} {'loss': 1.0143, 'learning_rate': 9.378303625685196e-06, 'epoch': 0.52} {'loss': 1.0191, 'learning_rate': 8.604846610560771e-06, 'epoch': 0.54} {'loss': 1.0176, 'learning_rate': 7.839821780235168e-06, 'epoch': 0.57} {'loss': 1.0169, 'learning_rate': 7.0878528777274814e-06, 'epoch': 0.59} 60%|█████████████████████████████████████████████████████████████████████▍ | 121/202 [142:29:15<78:43:12, 3498.68s/it] [process exited with code 1 (0x00000001)] You can now close this terminal with Ctrl+D, or press Enter to restart.

patchie commented 9 months ago

I ran it again, and now it seemed to successfully finish.

Adding the last lines of the log while running again, if it will help in any troubleshooting?

{'loss': 1.0238, 'learning_rate': 1.2462789068320016e-05, 'epoch': 0.42} {'loss': 1.013, 'learning_rate': 1.1702435557223988e-05, 'epoch': 0.44} {'loss': 1.022, 'learning_rate': 1.0931792674840718e-05, 'epoch': 0.47} {'loss': 1.0153, 'learning_rate': 1.0155518119203511e-05, 'epoch': 0.49} {'loss': 1.0143, 'learning_rate': 9.378303625685196e-06, 'epoch': 0.52} {'loss': 1.0191, 'learning_rate': 8.604846610560771e-06, 'epoch': 0.54} {'loss': 1.0176, 'learning_rate': 7.839821780235168e-06, 'epoch': 0.57} {'loss': 1.0169, 'learning_rate': 7.0878528777274814e-06, 'epoch': 0.59} {'loss': 1.0168, 'learning_rate': 6.35348473717345e-06, 'epoch': 0.62} {'loss': 1.0117, 'learning_rate': 5.64115581524629e-06, 'epoch': 0.64} {'loss': 1.0106, 'learning_rate': 4.955171365513603e-06, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 136/202 [150:33:44<70:43:02, 3857.31s/it][INFO|trainer.py:3158] 2023-12-10 17:22:46,485 >> Running Evaluation [INFO|trainer.py:3160] 2023-12-10 17:22:46,486 >> Num examples = 23110 [INFO|trainer.py:3163] 2023-12-10 17:22:46,486 >> Batch size = 1 {'eval_loss': 1.0159717798233032, 'eval_runtime': 19243.2251, 'eval_samples_per_second': 1.201, 'eval_steps_per_second': 1.201, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 136/202 [156:04:41<70:43:02, 3857.31s/it[INFO|trainer.py:1955] 2023-12-10 22:43:29,715 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 561882.257, 'train_samples_per_second': 0.37, 'train_steps_per_second': 0.0, 'train_loss': 1.0438810963841045, 'epoch': 0.67} 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 136/202 [156:04:41<75:44:37, 4131.48s/it] train metrics epoch = 0.67 train_loss = 1.0439 train_runtime = 6 days, 12:04:42.25 train_samples = 207865 train_samples_per_second = 0.37 train_steps_per_second = 0.0 2023-12-10 22:43:29 - INFO - main - Evaluate [INFO|trainer.py:3158] 2023-12-10 22:43:29,739 >> Running Evaluation [INFO|trainer.py:3160] 2023-12-10 22:43:29,739 >> Num examples = 23110 [INFO|trainer.py:3163] 2023-12-10 22:43:29,739 >> Batch size = 1 67%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 15431/23110 [5:22:04<2:40:16, 1.25s/it] eval metrics epoch = 0.67 eval_loss = 1.016 eval_runtime = 5:22:05.99 eval_samples = 23110 eval_samples_per_second = 1.196 eval_steps_per_second = 1.196 2023-12-11 04:05:35 - INFO - main - Save model [INFO|trainer.py:2881] 2023-12-11 04:05:35,784 >> Saving model checkpoint to data/zephyr-7b-sft-lora [INFO|tokenization_utils_base.py:2428] 2023-12-11 04:05:39,111 >> tokenizer config file saved in data/zephyr-7b-sft-lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2437] 2023-12-11 04:05:39,115 >> Special tokens file saved in data/zephyr-7b-sft-lora/special_tokens_map.json [INFO|trainer.py:2881] 2023-12-11 04:05:39,299 >> Saving model checkpoint to data/zephyr-7b-sft-lora [INFO|tokenization_utils_base.py:2428] 2023-12-11 04:05:41,961 >> tokenizer config file saved in data/zephyr-7b-sft-lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2437] 2023-12-11 04:05:41,966 >> Special tokens file saved in data/zephyr-7b-sft-lora/special_tokens_map.json events.out.tfevents.1702263935.17694.1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 359/359 [00:01<00:00, 189B/s]events.out.tfevents.1701096113.9499.0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.50k/8.50k [00:01<00:00, 4.45kB/s]events.out.tfevents.1701681021.4007.0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.65k/4.65k [00:01<00:00, 2.40kB/s]events.out.tfevents.1701682727.17694.0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.59k/9.59k [00:01<00:00, 4.94kB/s]training_args.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.66k/4.66k [00:00<00:00, 27.4kB/s]tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 493k/493k [00:00<00:00, 692kB/s]adapter_model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 218M/218M [00:19<00:00, 11.2MB/s]Upload 7 LFS files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:20<00:00, 2.87s/it]2023-12-11 04:06:06 - INFO - main - Model saved to data/zephyr-7b-sft-lora████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 217M/218M [00:18<00:00, 10.9MB/s][INFO|modelcard.py:452] 2023-12-11 04:06:06,770 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'HuggingFaceH4/ultrachat_200k', 'type': 'HuggingFaceH4/ultrachat_200k'}} [INFO|configuration_utils.py:461] 2023-12-11 04:06:06,779 >> Configuration saved in data/zephyr-7b-sft-lora/config.json 2023-12-11 04:06:06 - INFO - main - Pushing to hub... [INFO|trainer.py:2881] 2023-12-11 04:06:06,779 >> Saving model checkpoint to data/zephyr-7b-sft-lora [INFO|tokenization_utils_base.py:2428] 2023-12-11 04:06:09,653 >> tokenizer config file saved in data/zephyr-7b-sft-lora/tokenizer_config.json [INFO|tokenization_utils_base.py:2437] 2023-12-11 04:06:09,659 >> Special tokens file saved in data/zephyr-7b-sft-lora/special_tokens_map.json