huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.65k stars 442 forks source link

start train failed #629

Closed houwenchen closed 2 months ago

houwenchen commented 2 months ago

log info: NVIDIA CUDA Toolkit (available at https://developer.nvidia.com/cuda-downloads).

https://www.nvidia.com/Download/index.aspx), or can be downloaded as part of the

HINT: The NVIDIA Management Library ships with the NVIDIA display driver (available at

FATAL ERROR: NVIDIA Management Library (NVML) not found.

INFO | 2024-05-08 03:59:03 | autotrain.app:handle_form:464 - hardware: local-ui

INFO | 2024-05-08 03:47:01 | autotrain.app:fetch_params:215 - Task: llm:sft

INFO | 2024-05-08 03:23:50 | autotrain.app::157 - AutoTrain started successfully

WARNING | 2024-05-08 03:23:50 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: warmup_ratio, auto_find_batch_size, logging_steps, data_path, gradient_accumulation, max_grad_norm, seed, model, push_to_hub, batch_size, target_column, project_name, text_column, weight_decay, epochs, max_seq_length, valid_split, train_split, optimizer, evaluation_strategy, username, save_total_limit, token, scheduler, lr

WARNING | 2024-05-08 03:23:50 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: warmup_ratio, auto_find_batch_size, logging_steps, data_path, gradient_accumulation, max_grad_norm, seed, model, push_to_hub, batch_size, project_name, weight_decay, epochs, max_seq_length, tokens_column, valid_split, train_split, optimizer, evaluation_strategy, username, save_total_limit, token, tags_column, scheduler, lr

WARNING | 2024-05-08 03:23:50 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: tokenizer, image_path, sample_batch_size, model, checkpointing_steps, scale_lr, project_name, validation_epochs, class_labels_conditioning, epochs, prior_preservation, resume_from_checkpoint, center_crop, text_encoder_use_attention_mask, num_validation_images, rank, lr_power, num_cycles, token, class_prompt, scheduler, class_image_path, allow_tf32, pre_compute_text_embeddings, adam_beta1, max_grad_norm, seed, push_to_hub, validation_prompt, tokenizer_max_length, xl, checkpoints_total_limit, dataloader_num_workers, adam_weight_decay, prior_loss_weight, local_rank, num_class_images, warmup_steps, adam_beta2, username, revision, logging, prior_generation_precision, adam_epsilon, validation_images

WARNING | 2024-05-08 03:23:50 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: categorical_columns, data_path, seed, model, push_to_hub, project_name, target_columns, numerical_columns, id_column, train_split, task, num_trials, username, time_limit, token, valid_split

WARNING | 2024-05-08 03:23:50 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: peft, model, batch_size, target_column, project_name, lora_dropout, epochs, optimizer, evaluation_strategy, save_total_limit, token, max_target_length, scheduler, lr, warmup_ratio, auto_find_batch_size, logging_steps, gradient_accumulation, max_grad_norm, seed, push_to_hub, text_column, weight_decay, max_seq_length, lora_alpha, train_split, quantization, username, lora_r, data_path, valid_split

WARNING | 2024-05-08 03:23:50 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: warmup_ratio, auto_find_batch_size, logging_steps, data_path, gradient_accumulation, max_grad_norm, seed, model, push_to_hub, batch_size, target_column, project_name, weight_decay, image_column, epochs, valid_split, train_split, optimizer, evaluation_strategy, username, save_total_limit, token, scheduler, lr

WARNING | 2024-05-08 03:23:50 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: warmup_ratio, auto_find_batch_size, logging_steps, data_path, gradient_accumulation, max_grad_norm, seed, model, push_to_hub, batch_size, target_column, project_name, text_column, weight_decay, epochs, max_seq_length, valid_split, train_split, optimizer, evaluation_strategy, username, save_total_limit, token, scheduler, lr

WARNING | 2024-05-08 03:23:50 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: use_flash_attention_2, add_eos_token, model, max_prompt_length, batch_size, project_name, lora_dropout, merge_adapter, prompt_text_column, valid_split, optimizer, evaluation_strategy, save_total_limit, token, scheduler, lr, warmup_ratio, auto_find_batch_size, logging_steps, gradient_accumulation, max_grad_norm, seed, push_to_hub, weight_decay, text_column, disable_gradient_checkpointing, rejected_text_column, model_max_length, model_ref, trainer, train_split, lora_alpha, dpo_beta, username, lora_r, data_path

INFO | 2024-05-08 03:23:50 | autotrain.app::33 - Starting AutoTrain...

Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGetMemoryInfo. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop.

Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop.

project: https://huggingface.co/spaces/whou/FTModel

image
abhishekkrthakur commented 2 months ago

hi. please checkout the docs: https://hf.co/docs/autotrain if you still see issue, feel free to reopen.