Open dejankocic opened 2 weeks ago
it seems like you are using a single gpu. the config that you are using was tested on 8xH100.
Indeed it is a single GPU machine, RTX 2080 Ti. Is there any available configuration I could change it, so it runs on this configuration?
i dont think so, try to check some memory usage estimation tools like this one in order to know how much vrram you actually need.
Prerequisites
Backend
Local
Interface Used
CLI
CLI Command
autotrain --config configs/llm_finetuning/llama3-70b-sft.yml
UI Screenshots & Parameters
No response
Error Logs
INFO | 2024-05-16 10:42:17 | autotrain.cli.autotrain:main:52 - Using AutoTrain configuration: configs/llm_finetuning/llama3-70b-sft.yml INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init__:92 - Running task: lm_training INFO | 2024-05-16 10:42:17 | autotrain.parser:post_init:93 - Using backend: local WARNING | 2024-05-16 10:42:17 | autotrain.trainers.common:init__:174 - Parameters not supplied by user and set to default: logging_steps, max_grad_norm, model_ref, evaluation_strategy, lora_dropout, max_completion_length, lora_r, lora_alpha, use_flash_attention_2, disable_gradient_checkpointing, max_prompt_length, warmup_ratio, dpo_beta, auto_find_batch_size, weight_decay, save_total_limit, merge_adapter, prompt_text_column, rejected_text_column, seed, add_eos_token INFO | 2024-05-16 10:42:17 | autotrain.parser:run:144 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''} INFO | 2024-05-16 10:42:17 | autotrain.backends.local:create:8 - Starting local training... INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:349 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json'] INFO | 2024-05-16 10:42:17 | autotrain.commands:launch_command:350 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-math-v1-TEST', 'data_path': 'dejankocic/HelloWorldDataSet', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 2, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': None, 'text_column': 'text', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'dejankocic', 'token': ''} The following values were not passed to
sys.exit(main())
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1075, in launch_command
simple_launcher(args)
File "/home/dejan/python39venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 681, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/dejan/python39venv/bin/python3.9', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-math-v1-TEST/training_params.json']' died with <Signals.SIGKILL: 9>.
INFO | 2024-05-16 10:43:27 | autotrain.parser:run:149 - Job ID: 398903
accelerate launch
and had defaults used instead:--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or runaccelerate config
. INFO | 2024-05-16 10:42:21 | autotrain.trainers.clm.train_clm_sft:train:14 - Starting SFT training... INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:352 - Train data: Dataset({ features: ['text'], num_rows: 321 }) INFO | 2024-05-16 10:42:25 | autotrain.trainers.clm.utils:process_input_data:353 - Valid data: None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:423 - configuring logging steps INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_logging_steps:436 - Logging steps: 25 INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_training_args:441 - configuring training args INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.utils:configure_block_size:504 - Using block size 1024 INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:27 - loading model config... INFO | 2024-05-16 10:42:26 | autotrain.trainers.clm.train_clm_sft:train:35 - loading model... Loading checkpoint shards: 27%|ββββββββββββββββ | 8/30 [00:48<02:59, 8.14s/it]Traceback (most recent call last): File "/home/dejan/python39venv/bin/accelerate", line 8, inAdditional Information
The workstation I am running the script has 128GB of RAM, so dont think this is the case. BTW, utilization of RAM goes up to 60GB.