I attempted to fine-tune a 6 billion parameter model using 8 A100 GPUs, but the training process encountered interruptions. On the first attempt, it stopped at 0.15 epochs, and on the second attempt, where I started from 2 epochs, it oddly skipped some epochs, jumping from 0.15 directly to 1, and then stopped at 2.25. For more detailed information, you can check this WandB link - https://wandb.ai/neural-network-018/huggingface/runs/8xmy6gtd/

Configs -

Model arguments

model_name_or_path: 01-ai/Yi-6B model_revision: main torch_dtype: bfloat16 use_flash_attention_2: false trust_remote_code: true

Data training arguments

dataset_mixer: communityai/apt-chat-micro-dataset-llm-v2-714k: 0.4 dataset_splits:

train
test preprocessing_num_workers: 12

SFT trainer config

bf16: true do_eval: true evaluation_strategy: epoch gradient_accumulation_steps: 4 gradient_checkpointing: false hub_model_id: apt-chat-yi-6B-sft-full hub_strategy: every_save learning_rate: 0.00002 log_level: info logging_steps: 50 logging_strategy: steps lr_scheduler_type: cosine max_seq_length: 4096 max_steps: -1 num_train_epochs: 2 output_dir: data/apt-chat-yi-6B-sft-full overwrite_output_dir: true per_device_eval_batch_size: 1 per_device_train_batch_size: 1 push_to_hub: true remove_unused_columns: true report_to:

wandb save_strategy: "no" save_total_limit: null seed: 42 tf32: true

LOGS -

INFO:root:Using nproc_per_node=8. [2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] [2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] [2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2023-11-14 02:09:37,658] torch.distributed.run: [WARNING] [2023-11-14 02:09:45,328] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-14 02:09:45,584] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead. warnings.warn( [2023-11-14 02:09:45,607] [INFO] [comm.py:637:init_distributed] cdb=None 2023-11-14 02:09:45 - WARNING - main - Process rank: 7, device: cuda:7, n_gpu: 1 distributed training: True, 16-bits training: False [2023-11-14 02:09:45,646] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-14 02:09:45,793] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-14 02:09:45,832] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-14 02:09:45,834] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-14 02:09:45,835] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead. warnings.warn( [2023-11-14 02:09:45,864] [INFO] [comm.py:637:init_distributed] cdb=None [2023-11-14 02:09:45,908] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) /usr/local/lib/python3.10/dist-packages/trl/trainer/ppo_config.py:141: UserWarning: The optimize_cuda_cache arguement will be deprecated soon, please use optimize_device_cache instead. warnings.warn( 2023-11-14 02:09:45 - WARNING - main - Process rank: 5, device: cuda:5, n_gpu: 1 distributed training: True, 16-bits training: False [2023-11-14 02:09:45,939] [INFO] [comm.py:637:init_distributed] cdb=None [2023-11-14 02:09:45,939] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 2023-11-14 02:09:45 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False 2023-11-14 02:09:45 - INFO - main - Model parameters ModelArguments(base_model_revision=None, model_name_or_path='01-ai/Yi-6B', model_revision='main', model_code_revision=None, torch_dtype='bfloat16', trust_remote_code=True, use_flash_attention_2=False, use_peft=False, lora_r=16, lora_alpha=32, lora_dropout=0.05, lora_target_modules=None, lora_modules_to_save=None, load_in_8bit=False, load_in_4bit=False, bnb_4bit_quant_type='nf4', use_bnb_nested_quant=False) 2023-11-14 02:09:45 - INFO - main - Data parameters DataArguments(chat_template=None, dataset_mixer={'communityai/apt-chat-micro-dataset-llm-v2-714k': 0.4}, dataset_splits=['train', 'test'], max_train_samples=None, max_eval_samples=None, preprocessing_num_workers=12, truncation_side=None) 2023-11-14 02:09:45 - INFO - main - Training/evaluation parameters SFTConfig( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=apt-chat-yi-6B-sft-full, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=data/apt-chat-yi-6B-sft-full/runs/Nov14_02-09-45_6191edb408fa, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=50, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_seq_length=4096, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=2, optim=adamw_torch, optim_args=None, output_dir=data/apt-chat-yi-6B-sft-full, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=True, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['wandb'], resume_from_checkpoint=None, run_name=data/apt-chat-yi-6B-sft-full, save_on_each_node=False, save_safetensors=True, save_steps=500, save_strategy=no, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=False, tf32=True, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) 2023-11-14 02:09:48 - INFO - main - Training on the following datasets and their proportions: ['train : 285436', 'test : 500'] ++++++++++++++++++++++++++++++++++++++ YiTokenizer(name_or_path='01-ai/Yi-6B', vocab_size=64000, model_max_length=4096, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '', 'pad_token': ''}, clean_up_tokenization_spaces=False), added_tokens_decoder={ 0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 1: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), 2: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True), } 2023-11-14 02:09:53 - INFO - main - Load pretrained model neftune_noise_alpha - 5.0 training_args - 2023-11-14 02:09:53 - INFO - main - Model loaded! neftune_noise_alpha - 5.0 training_args - SFTConfig( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=apt-chat-yi-6B-sft-full, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=5, log_level=info, log_level_replica=warning, log_on_each_node=True, logging_dir=data/apt-chat-yi-6B-sft-full/runs/Nov14_02-09-45_6191edb408fa, logging_first_step=True, logging_nan_inf_filter=True, logging_steps=50, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=1.0, max_seq_length=4096, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=2, optim=adamw_torch, optim_args=None, output_dir=data/apt-chat-yi-6B-sft-full, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=True, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['wandb'], resume_from_checkpoint=None, run_name=data/apt-chat-yi-6B-sft-full, save_on_each_node=False, save_safetensors=True, save_steps=500, save_strategy=no, save_total_limit=None, seed=42, skip_memory_metrics=True, split_batches=False, tf32=True, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) /usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:145: UserWarning: You passed a model_id to the SFTTrainer. This will automatically create an AutoModelForCausalLM or a PeftModel (if you passed a peft_config) for you. warnings.warn( [INFO|configuration_utils.py:717] 2023-11-14 02:09:53,295 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/config.json [INFO|configuration_utils.py:717] 2023-11-14 02:09:53,384 >> loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/config.json [INFO|configuration_utils.py:777] 2023-11-14 02:09:53,386 >> Model config YiConfig { "_name_or_path": "01-ai/Yi-6B", "architectures": [ "YiForCausalLM" ], "auto_map": { "AutoConfig": "01-ai/Yi-6B--configuration_yi.YiConfig", "AutoModel": "01-ai/Yi-6B--modeling_yi.YiModel", "AutoModelForCausalLM": "01-ai/Yi-6B--modeling_yi.YiForCausalLM" }, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "Yi", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 4, "pad_token_id": 0, "rms_norm_eps": 1e-05, "rope_theta": 5000000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.35.0", "use_cache": true, "vocab_size": 64000 }

[INFO|modeling_utils.py:3121] 2023-11-14 02:09:53,499 >> loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--01-ai--Yi-6B/snapshots/5978aa81cd0fb25852004e7a86c71435b3f8de31/model.safetensors.index.json [INFO|modeling_utils.py:1222] 2023-11-14 02:09:53,501 >> Instantiating YiForCausalLM model under default dtype torch.bfloat16. [INFO|configuration_utils.py:791] 2023-11-14 02:09:53,503 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 0 }

[2023-11-14 02:10:02,797] [INFO] [config.py:972:print] DeepSpeedEngine configuration: [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] amp_enabled .................. False [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] amp_params ................... False [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] bfloat16_enabled ............. True [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f3e8d053e50> [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] communication_data_type ...... None [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] curriculum_params_legacy ..... False [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] data_efficiency_enabled ...... False [2023-11-14 02:10:02,797] [INFO] [config.py:976:print] dataloader_drop_last ......... False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] disable_allgather ............ False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] dump_state ................... False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_enabled ........... False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] eigenvalue_verbose ........... False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] elasticity_enabled ........... False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_auto_cast ............... None [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_enabled ................. False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] global_rank .................. 0 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] grad_accum_dtype ............. None [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_accumulation_steps .. 4 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_clipping ............ 0.0 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] load_universal_checkpoint .... False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] loss_scale ................... 1.0 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] memory_breakdown ............. False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] mics_hierarchial_params_gather False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] mics_shard_size .............. -1 [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_name ............... None [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] optimizer_params ............. None [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pld_enabled .................. False [2023-11-14 02:10:02,798] [INFO] [config.py:976:print] pld_params ................... False [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] prescale_gradients ........... False [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] scheduler_name ............... None [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] scheduler_params ............. None [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32 [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] sparse_attention ............. None [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] steps_per_print .............. inf [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] train_batch_size ............. 32 [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 1 [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] use_node_local_storage ....... False [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] wall_clock_breakdown ......... False [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] weight_quantization_config ... None [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] world_size ................... 8 [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_allow_untested_optimizer True [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_enabled ................. True [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True [2023-11-14 02:10:02,799] [INFO] [config.py:976:print] zero_optimization_stage ...... 3 [2023-11-14 02:10:02,799] [INFO] [config.py:962:print_user_config] json = { "train_batch_size": 32, "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 4, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none", "nvme_path": null }, "offload_param": { "device": "none", "nvme_path": null }, "stage3_gather_16bit_weights_on_model_save": true }, "steps_per_print": inf, "bf16": { "enabled": true }, "fp16": { "enabled": false }, "zero_allow_untested_optimizer": true } [INFO|trainer.py:1723] 2023-11-14 02:10:02,799 >> Running training [INFO|trainer.py:1724] 2023-11-14 02:10:02,799 >> Num examples = 285,436 [INFO|trainer.py:1725] 2023-11-14 02:10:02,799 >> Num Epochs = 2 [INFO|trainer.py:1726] 2023-11-14 02:10:02,799 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1729] 2023-11-14 02:10:02,799 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:1730] 2023-11-14 02:10:02,799 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1731] 2023-11-14 02:10:02,799 >> Total optimization steps = 17,840 [INFO|trainer.py:1732] 2023-11-14 02:10:02,801 >> Number of trainable parameters = 6,061,035,520 [INFO|integration_utils.py:718] 2023-11-14 02:10:02,802 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: developer-team018 (neural-network-018). Use wandb login --relogin to force relogin wandb: Tracking run with wandb version 0.16.0 wandb: Run data is saved locally in /workspace/alignment-handbook/wandb/run-20231114_021003-8xmy6gtd wandb: Run wandb offline to turn off syncing. wandb: Syncing run robust-plasma-26 wandb: ⭐️ View project at https://wandb.ai/neural-network-018/huggingface wandb: 🚀 View run at https://wandb.ai/neural-network-018/huggingface/runs/8xmy6gtd 0%| | 0/17840 [00:00<?, ?it/s][WARNING|tokenization_utils_base.py:3831] 2023-11-14 02:10:23,512 >> Token indices sequence length is longer than the specified maximum sequence length for this model (6114 > 4096). Running this sequence through the model will result in indexing errors {'loss': 1.7024, 'learning_rate': 1.9999999844947046e-05, 'epoch': 0.0}
{'loss': 1.1507, 'learning_rate': 1.999961237011484e-05, 'epoch': 0.01}
{'loss': 1.0928, 'learning_rate': 1.9998449510510744e-05, 'epoch': 0.01}
{'loss': 1.0793, 'learning_rate': 1.999651151133954e-05, 'epoch': 0.02}
{'loss': 1.0867, 'learning_rate': 1.999379852284651e-05, 'epoch': 0.02}
{'loss': 1.0857, 'learning_rate': 1.999031075535873e-05, 'epoch': 0.03}
{'loss': 1.0721, 'learning_rate': 1.9986048479268788e-05, 'epoch': 0.03}
{'loss': 1.0923, 'learning_rate': 1.99810120250138e-05, 'epoch': 0.04}
{'loss': 1.0836, 'learning_rate': 1.9975201783049804e-05, 'epoch': 0.04}
{'loss': 1.0769, 'learning_rate': 1.9968618203821487e-05, 'epoch': 0.05}
{'loss': 1.0574, 'learning_rate': 1.9961261797727256e-05, 'epoch': 0.06}
{'loss': 1.042, 'learning_rate': 1.9953133135079686e-05, 'epoch': 0.06}
{'loss': 1.0554, 'learning_rate': 1.9944232846061284e-05, 'epoch': 0.07}
{'loss': 1.0735, 'learning_rate': 1.993456162067566e-05, 'epoch': 0.07}
{'loss': 1.0785, 'learning_rate': 1.992412020869401e-05, 'epoch': 0.08}
{'loss': 1.0654, 'learning_rate': 1.9912909419596993e-05, 'epoch': 0.08}
{'loss': 1.0606, 'learning_rate': 1.9900930122511993e-05, 'epoch': 0.09}
{'loss': 1.0664, 'learning_rate': 1.988818324614572e-05, 'epoch': 0.1}
{'loss': 1.0604, 'learning_rate': 1.9874669778712215e-05, 'epoch': 0.1}
{'loss': 1.0674, 'learning_rate': 1.9860390767856244e-05, 'epoch': 0.11}
{'loss': 1.042, 'learning_rate': 1.984534732057208e-05, 'epoch': 0.11}
{'loss': 1.0452, 'learning_rate': 1.9829540603117667e-05, 'epoch': 0.12}
{'loss': 1.0577, 'learning_rate': 1.9812971840924222e-05, 'epoch': 0.12}
{'loss': 1.0471, 'learning_rate': 1.979564231850122e-05, 'epoch': 0.13}
{'loss': 1.0704, 'learning_rate': 1.977755337933682e-05, 'epoch': 0.13}
{'loss': 1.0282, 'learning_rate': 1.9758706425793702e-05, 'epoch': 0.14}
{'loss': 1.0515, 'learning_rate': 1.973910291900036e-05, 'epoch': 0.15}
{'loss': 1.0548, 'learning_rate': 1.97187443787378e-05, 'epoch': 0.15}
8%|██▌ | 1368/17840 [1:50:57<19:16:41, 4.21s/it][INFO|trainer.py:3158] 2023-11-14 04:01:02,181 >> Running Evaluation [INFO|trainer.py:3160] 2023-11-14 04:01:02,182 >> Num examples = 500 [INFO|trainer.py:3163] 2023-11-14 04:01:02,182 >> Batch size = 1

0%| | 0/63 [00:00<?, ?it/s] 3%|█▍ | 2/63 [00:00<00:12, 5.02it/s] 5%|██ | 3/63 [00:00<00:12, 4.76it/s] 6%|██▊ | 4/63 [00:00<00:15, 3.89it/s] 8%|███▍ | 5/63 [00:01<00:16, 3.50it/s] 10%|████▏ | 6/63 [00:01<00:17, 3.30it/s] 11%|████▉ | 7/63 [00:01<00:17, 3.20it/s] 13%|█████▌ | 8/63 [00:02<00:17, 3.12it/s]

                                                                         {'eval_loss': 1.0247304439544678, 'eval_runtime': 4.5889, 'eval_samples_per_second': 108.959, 'eval_steps_per_second': 13.729, 'epoch': 0.15}

8%|██▌ | 1368/17840 [1:51:02<19:16:41, 4.21s/it] 14%|██████▎ | 9/63 [00:02<00:17, 3.14it/s] {'loss': 0.9636, 'learning_rate': 1.9697632383321755e-05, 'epoch': 1.0}
{'loss': 0.9026, 'learning_rate': 1.96757685694803e-05, 'epoch': 1.01}
{'loss': 0.8808, 'learning_rate': 1.965315463222695e-05, 'epoch': 1.01}
{'loss': 0.8712, 'learning_rate': 1.9629792324729302e-05, 'epoch': 1.02}
{'loss': 0.8967, 'learning_rate': 1.960568345817306e-05, 'epoch': 1.03}
{'loss': 0.8676, 'learning_rate': 1.9580829901621666e-05, 'epoch': 1.03}
{'loss': 0.8723, 'learning_rate': 1.9555233581871366e-05, 'epoch': 1.04}
{'loss': 0.9122, 'learning_rate': 1.9528896483301866e-05, 'epoch': 1.04}
{'loss': 0.8687, 'learning_rate': 1.9501820647722458e-05, 'epoch': 1.05}
{'loss': 0.8726, 'learning_rate': 1.947400817421375e-05, 'epoch': 1.05}
{'loss': 0.8505, 'learning_rate': 1.944546121896493e-05, 'epoch': 1.06}
{'loss': 0.8458, 'learning_rate': 1.9416181995106585e-05, 'epoch': 1.07}
{'loss': 0.8721, 'learning_rate': 1.9386172772539162e-05, 'epoch': 1.07}
{'loss': 0.8676, 'learning_rate': 1.9355435877756957e-05, 'epoch': 1.08}
{'loss': 0.8826, 'learning_rate': 1.9323973693667762e-05, 'epoch': 1.08}
{'loss': 0.8607, 'learning_rate': 1.929178865940815e-05, 'epoch': 1.09}
{'loss': 0.8561, 'learning_rate': 1.925888327015434e-05, 'epoch': 1.09}
{'loss': 0.8687, 'learning_rate': 1.9225260076928783e-05, 'epoch': 1.1}
{'loss': 0.874, 'learning_rate': 1.919092168640239e-05, 'epoch': 1.1}
{'loss': 0.8563, 'learning_rate': 1.915587076069243e-05, 'epoch': 1.11}
{'loss': 0.8445, 'learning_rate': 1.9120110017156172e-05, 'epoch': 1.12}
{'loss': 0.8646, 'learning_rate': 1.908364222818019e-05, 'epoch': 1.12}
{'loss': 0.8479, 'learning_rate': 1.9046470220965457e-05, 'epoch': 1.13}
{'loss': 0.8788, 'learning_rate': 1.9008596877308157e-05, 'epoch': 1.13}
{'loss': 0.9, 'learning_rate': 1.8970025133376252e-05, 'epoch': 1.14}
{'loss': 0.8791, 'learning_rate': 1.893075797948188e-05, 'epoch': 1.14}
{'loss': 0.9254, 'learning_rate': 1.889079845984951e-05, 'epoch': 1.15}
15%|█████ | 2736/17840 [3:42:25<17:42:31, 4.22s/it][INFO|trainer.py:3158] 2023-11-14 05:52:30,316 >> Running Evaluation [INFO|trainer.py:3160] 2023-11-14 05:52:30,317 >> Num examples = 500 [INFO|trainer.py:3163] 2023-11-14 05:52:30,317 >> Batch size = 1

0%| | 0/63 [00:00<?, ?it/s] 3%|█▍ | 2/63 [00:00<00:10, 6.07it/s] 5%|██ | 3/63 [00:00<00:14, 4.20it/s] 6%|██▊ | 4/63 [00:01<00:16, 3.63it/s] 8%|███▍ | 5/63 [00:01<00:17, 3.37it/s] 10%|████▏ | 6/63 [00:01<00:17, 3.23it/s] 11%|████▉ | 7/63 [00:02<00:17, 3.16it/s] 13%|█████▌ | 8/63 [00:02<00:17, 3.06it/s]

{'eval_loss': 1.0676991939544678, 'eval_runtime': 4.5191, 'eval_samples_per_second': 110.641, 'eval_steps_per_second': 13.941, 'epoch': 1.15} 15%|█████ | 2736/17840 [3:42:30<17:42:31, 4.22s/it] 14%|██████▎ | 9/63 [00:02<00:17, 3.09it/s] [INFO|trainer.py:1955] 2023-11-14 05:52:34,837 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 13352.0365, 'train_samples_per_second': 42.755, 'train_steps_per_second': 1.336, 'train_loss': 0.9719247023264567, 'epoch': 1.15} 15%|█████ | 2736/17840 [3:42:30<20:28:20, 4.88s/it] train metrics epoch = 1.15 train_loss = 0.9719 train_runtime = 3:42:32.03 train_samples = 285436 train_samples_per_second = 42.755 train_steps_per_second = 1.336 2023-11-14 05:52:34 - INFO - main - Evaluate [INFO|trainer.py:3158] 2023-11-14 05:52:34,843 >> Running Evaluation [INFO|trainer.py:3160] 2023-11-14 05:52:34,843 >> Num examples = 500 [INFO|trainer.py:3163] 2023-11-14 05:52:34,844 >> Batch size = 1 14%|██████▎ | 9/63 [00:02<00:16, 3.23it/s] eval metrics epoch = 1.15 eval_loss = 1.0677 eval_runtime = 0:00:04.48 eval_samples = 500 eval_samples_per_second = 111.451 eval_steps_per_second = 14.043 2023-11-14 05:52:39 - INFO - main - Save model [INFO|trainer.py:2881] 2023-11-14 05:52:43,590 >> Saving model checkpoint to data/apt-chat-yi-6B-sft-full [INFO|configuration_utils.py:461] 2023-11-14 05:52:43,592 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json [INFO|configuration_utils.py:564] 2023-11-14 05:52:43,592 >> Configuration saved in data/apt-chat-yi-6B-sft-full/generation_config.json [INFO|modeling_utils.py:2201] 2023-11-14 05:52:51,334 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 3 checkpoint shards. You can find where each parameters has been saved in the index located at data/apt-chat-yi-6B-sft-full/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2428] 2023-11-14 05:52:51,336 >> tokenizer config file saved in data/apt-chat-yi-6B-sft-full/tokenizer_config.json [INFO|tokenization_utils_base.py:2437] 2023-11-14 05:52:51,337 >> Special tokens file saved in data/apt-chat-yi-6B-sft-full/special_tokens_map.json [INFO|trainer.py:2881] 2023-11-14 05:52:55,599 >> Saving model checkpoint to data/apt-chat-yi-6B-sft-full [INFO|configuration_utils.py:461] 2023-11-14 05:52:55,601 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json [INFO|configuration_utils.py:564] 2023-11-14 05:52:55,601 >> Configuration saved in data/apt-chat-yi-6B-sft-full/generation_config.json [INFO|modeling_utils.py:2201] 2023-11-14 05:53:06,302 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 3 checkpoint shards. You can find where each parameters has been saved in the index located at data/apt-chat-yi-6B-sft-full/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2428] 2023-11-14 05:53:06,303 >> tokenizer config file saved in data/apt-chat-yi-6B-sft-full/tokenizer_config.json [INFO|tokenization_utils_base.py:2437] 2023-11-14 05:53:06,304 >> Special tokens file saved in data/apt-chat-yi-6B-sft-full/special_tokens_map.json 2023-11-14 05:55:20 - INFO - main - Model saved to data/apt-chat-yi-6B-sft-full [INFO|modelcard.py:452] 2023-11-14 05:55:21,054 >> Dropping the following result as it does not have all the necessary fields: {'dataset': {'name': 'communityai/apt-chat-micro-dataset-llm-v2-714k', 'type': 'communityai/apt-chat-micro-dataset-llm-v2-714k'}} [INFO|configuration_utils.py:461] 2023-11-14 05:55:21,057 >> Configuration saved in data/apt-chat-yi-6B-sft-full/config.json 2023-11-14 05:55:21 - INFO - main - Pushing to hub...

huggingface / alignment-handbook

Training Interruptions and Epoch Skipping with 6 Billion Parameter Model on 8 A100 GPUs #24

Model arguments

Data training arguments

SFT trainer config