Autotrain CLI suddenly crashes

dejankocic commented 2 weeks ago

Prerequisites

[X] I have read the documentation.
[X] I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config configs/llm_finetuning/llama3-70b-sft.yml

UI Screenshots & Parameters

No response

Error Logs

INFO | 2024-05-16 10:42:17 | autotrain.cli.autotrain:main:52 - Using AutoTrain configuration: configs/llm_finetuning/llama3-70b-sft.yml INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init__:92 - Running task: lm_training INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:93 - Using backend: local WARNING | 2024-05-16 10:42:17 | autotrain.trainers.common:init__:174 - Parameters not supplied by user and set to default: logging_steps, max_grad_norm, model_ref, evaluation_strategy, lora_dropout, max_completion_length, lora_r, lora_alpha, use_flash_attention_2, disable_gradient_checkpointing, max_prompt_length, warmup_ratio, dpo_beta, auto_find_batch_size, weight_decay, save_total_limit, merge_adapter, prompt_text_column, rejected_text_column, seed, add_eos_token INFO | 2024-05-16 10:42:17 | autotrain.parser:run:144 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''} INFO | 2024-05-16 10:42:17 | autotrain.backends.local:create:8 - Starting local training... INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:349 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json'] INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:350 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''} The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. INFO | 2024-05-16 10:42:21 | autotrain.trainers.clm.train_clm_sft:train:14 - Starting SFT training... INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:352 - Train data: Dataset({ features: ['text'], num_rows: 321 }) INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:353 - Valid data: None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:423 - configuring logging steps INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:436 - Logging steps: 25 INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_training_args:441 - configuring training args INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_block_size:504 - Using block size 1024 INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:27 - loading model config... INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:35 - loading model... Loading checkpoint shards: 27%|███████████████▏ | 8/30 [00:48<02:59, 8.14s/it]Traceback (most recent call last): File "/home/dejan/python39venv/bin/accelerate", line 8, in sys.exit(main()) File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1075, in launch_command simple_launcher(args) File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 681, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/home/dejan/python39venv/bin/python3.9', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']' died with <Signals.SIGKILL: 9>. INFO | 2024-05-16 10:43:27 | autotrain.parser:run:149 - Job ID: 398903

Additional Information

The workstation I am running the script has 128GB of RAM, so dont think this is the case. BTW, utilization of RAM goes up to 60GB.

abhishekkrthakur commented 2 weeks ago

it seems like you are using a single gpu. the config that you are using was tested on 8xH100.

dejankocic commented 2 weeks ago

Indeed it is a single GPU machine, RTX 2080 Ti. Is there any available configuration I could change it, so it runs on this configuration?

hichambht32 commented 1 week ago

i dont think so, try to check some memory usage estimation tools like this one in order to know how much vrram you actually need. usagegpu

huggingface / autotrain-advanced