huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.65k stars 442 forks source link

[BUG] Error While Trying to Start the Training #595

Closed pjahoorkar closed 2 months ago

pjahoorkar commented 2 months ago

I get the following error while trying to train the Llama3 model. Appreciate any thoughts. Thanks.

Prerequisites

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

Screenshot 2024-04-23 121251 Screenshot 2024-04-23 121511

Error Logs

Device 0: NVIDIA A10G - 307.6MiB/22.49GiB


INFO | 2024-04-23 11:12:23 | autotrain.app:handle_form:454 - hardware: Local

INFO | 2024-04-23 11:11:16 | autotrain.app:fetch_params:212 - Task: llm:sft

INFO | 2024-04-23 11:10:40 | autotrain.app::154 - AutoTrain started successfully

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, tags_column, weight_decay, save_strategy, token, repo_id, batch_size, max_grad_norm, data_path, max_seq_length, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, tokens_column, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: adam_beta2, warmup_steps, scheduler, class_image_path, adam_epsilon, checkpoints_total_limit, revision, text_encoder_use_attention_mask, image_path, seed, prior_preservation, xl, adam_beta1, prior_loss_weight, validation_images, prior_generation_precision, tokenizer_max_length, model, logging, push_to_hub, rank, center_crop, allow_tf32, local_rank, num_validation_images, token, validation_prompt, repo_id, scale_lr, checkpointing_steps, sample_batch_size, class_labels_conditioning, class_prompt, max_grad_norm, adam_weight_decay, num_class_images, username, tokenizer, resume_from_checkpoint, lr_power, num_cycles, pre_compute_text_embeddings, validation_epochs, epochs, dataloader_num_workers, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, push_to_hub, task, numerical_columns, num_trials, token, repo_id, id_column, data_path, time_limit, seed, username, train_split, valid_split, categorical_columns, target_columns, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: scheduler, lora_alpha, lora_dropout, max_target_length, target_column, text_column, data_path, seed, save_total_limit, peft, gradient_accumulation, model, warmup_ratio, optimizer, push_to_hub, weight_decay, lora_r, token, repo_id, batch_size, max_grad_norm, quantization, max_seq_length, username, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, image_column, weight_decay, save_strategy, token, target_column, repo_id, batch_size, max_grad_norm, data_path, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: model, warmup_ratio, optimizer, scheduler, push_to_hub, weight_decay, save_strategy, token, target_column, repo_id, text_column, batch_size, max_grad_norm, data_path, max_seq_length, seed, save_total_limit, username, gradient_accumulation, logging_steps, lr, train_split, valid_split, evaluation_strategy, epochs, auto_find_batch_size, project_name

WARNING | 2024-04-23 11:10:39 | autotrain.trainers.common:init:170 - Parameters not supplied by user and set to default: trainer, scheduler, use_flash_attention_2, lora_alpha, lora_dropout, merge_adapter, model_ref, text_column, data_path, dpo_beta, add_eos_token, seed, save_total_limit, prompt_text_column, gradient_accumulation, model, warmup_ratio, optimizer, push_to_hub, model_max_length, weight_decay, lora_r, token, repo_id, disable_gradient_checkpointing, rejected_text_column, batch_size, max_grad_norm, username, logging_steps, evaluation_strategy, train_split, valid_split, lr, max_prompt_length, auto_find_batch_size, project_name

INFO | 2024-04-23 11:10:39 | autotrain.app::31 - Starting AutoTrain...

Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGetMemoryInfo. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop.

Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop.

Additional Information

No response

abhishekkrthakur commented 2 months ago

whats the error?

abhishekkrthakur commented 2 months ago

closing issue since there is no error.