huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.65k stars 442 forks source link

[BUG] Inference times out even though model finetuning finished successfully #628

Closed bertilmuth closed 2 months ago

bertilmuth commented 2 months ago

Prerequisites

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

image image

Error Logs

===== Application Startup at 2024-05-07 14:29:17 =====

========== == CUDA ==

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Found existing installation: autotrain-advanced 0.7.80.dev0 Uninstalling autotrain-advanced-0.7.80.dev0: Successfully uninstalled autotrain-advanced-0.7.80.dev0 Collecting autotrain-advanced Downloading autotrain_advanced-0.7.79-py3-none-any.whl.metadata (13 kB) Requirement already satisfied: albumentations==1.4.4 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.4.4) Requirement already satisfied: codecarbon==2.3.5 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.3.5) Requirement already satisfied: datasets~=2.19.0 in ./env/lib/python3.10/site-packages (from datasets[vision]~=2.19.0->autotrain-advanced) (2.19.1) Requirement already satisfied: evaluate==0.4.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.4.1) Requirement already satisfied: ipadic==1.0.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.0.0) Requirement already satisfied: jiwer==3.0.3 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (3.0.3) Requirement already satisfied: joblib==1.4.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.4.0) Requirement already satisfied: loguru==0.7.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.7.2) Requirement already satisfied: pandas==2.2.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.2.2) Requirement already satisfied: nltk==3.8.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (3.8.1) Requirement already satisfied: optuna==3.6.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (3.6.1) Requirement already satisfied: Pillow==10.3.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (10.3.0) Requirement already satisfied: protobuf==4.23.4 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (4.23.4) Requirement already satisfied: sacremoses==0.1.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.1.1) Requirement already satisfied: scikit-learn==1.4.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.4.2) Requirement already satisfied: sentencepiece==0.2.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.2.0) Requirement already satisfied: tqdm==4.66.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (4.66.2) Requirement already satisfied: werkzeug==3.0.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (3.0.2) Requirement already satisfied: xgboost==2.0.3 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.0.3) Requirement already satisfied: huggingface-hub==0.22.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.22.2) Requirement already satisfied: requests==2.31.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.31.0) Requirement already satisfied: einops==0.7.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.7.0) Requirement already satisfied: invisible-watermark==0.2.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.2.0) Requirement already satisfied: packaging==24.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (24.0) Requirement already satisfied: cryptography==42.0.5 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (42.0.5) Requirement already satisfied: nvitop==1.3.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.3.2) Requirement already satisfied: tensorboard==2.16.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.16.2) Requirement already satisfied: peft==0.10.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.10.0) Requirement already satisfied: trl==0.8.6 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.8.6) Requirement already satisfied: tiktoken==0.6.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.6.0) Requirement already satisfied: transformers==4.40.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (4.40.1) Requirement already satisfied: accelerate==0.29.3 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.29.3) Requirement already satisfied: diffusers==0.27.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.27.2) Requirement already satisfied: rouge-score==0.1.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.1.2) Requirement already satisfied: py7zr==0.21.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.21.0) Requirement already satisfied: fastapi==0.110.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.110.2) Requirement already satisfied: uvicorn==0.29.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.29.0) Requirement already satisfied: python-multipart==0.0.9 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.0.9) Requirement already satisfied: pydantic==2.7.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.7.1) Requirement already satisfied: hf-transfer in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.1.6) Requirement already satisfied: pyngrok==7.1.6 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (7.1.6) Requirement already satisfied: authlib==1.3.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.3.0) Requirement already satisfied: itsdangerous==2.2.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (2.2.0) Requirement already satisfied: seqeval==1.2.2 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (1.2.2) Requirement already satisfied: httpx==0.27.0 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.27.0) Requirement already satisfied: pyyaml==6.0.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (6.0.1) Requirement already satisfied: bitsandbytes==0.43.1 in ./env/lib/python3.10/site-packages (from autotrain-advanced) (0.43.1) Requirement already satisfied: numpy>=1.17 in ./env/lib/python3.10/site-packages (from accelerate==0.29.3->autotrain-advanced) (1.26.4) Requirement already satisfied: psutil in ./env/lib/python3.10/site-packages (from accelerate==0.29.3->autotrain-advanced) (5.9.8) Requirement already satisfied: torch>=1.10.0 in ./env/lib/python3.10/site-packages (from accelerate==0.29.3->autotrain-advanced) (2.3.0) Requirement already satisfied: safetensors>=0.3.1 in ./env/lib/python3.10/site-packages (from accelerate==0.29.3->autotrain-advanced) (0.4.3) Requirement already satisfied: scipy>=1.10.0 in ./env/lib/python3.10/site-packages (from albumentations==1.4.4->autotrain-advanced) (1.13.0) Requirement already satisfied: scikit-image>=0.21.0 in ./env/lib/python3.10/site-packages (from albumentations==1.4.4->autotrain-advanced) (0.23.2) Requirement already satisfied: typing-extensions>=4.9.0 in ./env/lib/python3.10/site-packages (from albumentations==1.4.4->autotrain-advanced) (4.9.0) Requirement already satisfied: opencv-python-headless>=4.9.0 in ./env/lib/python3.10/site-packages (from albumentations==1.4.4->autotrain-advanced) (4.9.0.80) Requirement already satisfied: arrow in ./env/lib/python3.10/site-packages (from codecarbon==2.3.5->autotrain-advanced) (1.3.0) Requirement already satisfied: pynvml in ./env/lib/python3.10/site-packages (from codecarbon==2.3.5->autotrain-advanced) (11.5.0) Requirement already satisfied: py-cpuinfo in ./env/lib/python3.10/site-packages (from codecarbon==2.3.5->autotrain-advanced) (9.0.0) Requirement already satisfied: rapidfuzz in ./env/lib/python3.10/site-packages (from codecarbon==2.3.5->autotrain-advanced) (3.9.0) Requirement already satisfied: click in ./env/lib/python3.10/site-packages (from codecarbon==2.3.5->autotrain-advanced) (8.1.7) Requirement already satisfied: prometheus-client in ./env/lib/python3.10/site-packages (from codecarbon==2.3.5->autotrain-advanced) (0.20.0) Requirement already satisfied: cffi>=1.12 in ./env/lib/python3.10/site-packages (from cryptography==42.0.5->autotrain-advanced) (1.16.0) Requirement already satisfied: importlib-metadata in ./env/lib/python3.10/site-packages (from diffusers==0.27.2->autotrain-advanced) (7.1.0) Requirement already satisfied: filelock in ./env/lib/python3.10/site-packages (from diffusers==0.27.2->autotrain-advanced) (3.13.1) Requirement already satisfied: regex!=2019.12.17 in ./env/lib/python3.10/site-packages (from diffusers==0.27.2->autotrain-advanced) (2024.4.28) Requirement already satisfied: dill in ./env/lib/python3.10/site-packages (from evaluate==0.4.1->autotrain-advanced) (0.3.8) Requirement already satisfied: xxhash in ./env/lib/python3.10/site-packages (from evaluate==0.4.1->autotrain-advanced) (3.4.1) Requirement already satisfied: multiprocess in ./env/lib/python3.10/site-packages (from evaluate==0.4.1->autotrain-advanced) (0.70.16) Requirement already satisfied: fsspec>=2021.05.0 in ./env/lib/python3.10/site-packages (from fsspec[http]>=2021.05.0->evaluate==0.4.1->autotrain-advanced) (2024.3.1) Requirement already satisfied: responses<0.19 in ./env/lib/python3.10/site-packages (from evaluate==0.4.1->autotrain-advanced) (0.18.0) Requirement already satisfied: starlette<0.38.0,>=0.37.2 in ./env/lib/python3.10/site-packages (from fastapi==0.110.2->autotrain-advanced) (0.37.2) Requirement already satisfied: anyio in ./env/lib/python3.10/site-packages (from httpx==0.27.0->autotrain-advanced) (4.3.0) Requirement already satisfied: certifi in ./env/lib/python3.10/site-packages (from httpx==0.27.0->autotrain-advanced) (2024.2.2) Requirement already satisfied: httpcore==1.* in ./env/lib/python3.10/site-packages (from httpx==0.27.0->autotrain-advanced) (1.0.5) Requirement already satisfied: idna in ./env/lib/python3.10/site-packages (from httpx==0.27.0->autotrain-advanced) (3.7) Requirement already satisfied: sniffio in ./env/lib/python3.10/site-packages (from httpx==0.27.0->autotrain-advanced) (1.3.1) Requirement already satisfied: PyWavelets>=1.1.1 in ./env/lib/python3.10/site-packages (from invisible-watermark==0.2.0->autotrain-advanced) (1.6.0) Requirement already satisfied: opencv-python>=4.1.0.25 in ./env/lib/python3.10/site-packages (from invisible-watermark==0.2.0->autotrain-advanced) (4.9.0.80) Requirement already satisfied: nvidia-ml-py<12.536.0a0,>=11.450.51 in ./env/lib/python3.10/site-packages (from nvitop==1.3.2->autotrain-advanced) (12.535.161) Requirement already satisfied: cachetools>=1.0.1 in ./env/lib/python3.10/site-packages (from nvitop==1.3.2->autotrain-advanced) (5.3.3) Requirement already satisfied: termcolor>=1.0.0 in ./env/lib/python3.10/site-packages (from nvitop==1.3.2->autotrain-advanced) (2.4.0) Requirement already satisfied: alembic>=1.5.0 in ./env/lib/python3.10/site-packages (from optuna==3.6.1->autotrain-advanced) (1.13.1) Requirement already satisfied: colorlog in ./env/lib/python3.10/site-packages (from optuna==3.6.1->autotrain-advanced) (6.8.2) Requirement already satisfied: sqlalchemy>=1.3.0 in ./env/lib/python3.10/site-packages (from optuna==3.6.1->autotrain-advanced) (2.0.30) Requirement already satisfied: python-dateutil>=2.8.2 in ./env/lib/python3.10/site-packages (from pandas==2.2.2->autotrain-advanced) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in ./env/lib/python3.10/site-packages (from pandas==2.2.2->autotrain-advanced) (2024.1) Requirement already satisfied: tzdata>=2022.7 in ./env/lib/python3.10/site-packages (from pandas==2.2.2->autotrain-advanced) (2024.1) Requirement already satisfied: texttable in ./env/lib/python3.10/site-packages (from py7zr==0.21.0->autotrain-advanced) (1.7.0) Requirement already satisfied: pycryptodomex>=3.16.0 in ./env/lib/python3.10/site-packages (from py7zr==0.21.0->autotrain-advanced) (3.20.0) Requirement already satisfied: pyzstd>=0.15.9 in ./env/lib/python3.10/site-packages (from py7zr==0.21.0->autotrain-advanced) (0.15.10) Requirement already satisfied: pyppmd<1.2.0,>=1.1.0 in ./env/lib/python3.10/site-packages (from py7zr==0.21.0->autotrain-advanced) (1.1.0) Requirement already satisfied: pybcj<1.1.0,>=1.0.0 in ./env/lib/python3.10/site-packages (from py7zr==0.21.0->autotrain-advanced) (1.0.2) Requirement already satisfied: multivolumefile>=0.2.3 in ./env/lib/python3.10/site-packages (from py7zr==0.21.0->autotrain-advanced) (0.2.3) Requirement already satisfied: inflate64<1.1.0,>=1.0.0 in ./env/lib/python3.10/site-packages (from py7zr==0.21.0->autotrain-advanced) (1.0.0) Requirement already satisfied: brotli>=1.1.0 in ./env/lib/python3.10/site-packages (from py7zr==0.21.0->autotrain-advanced) (1.1.0) Requirement already satisfied: annotated-types>=0.4.0 in ./env/lib/python3.10/site-packages (from pydantic==2.7.1->autotrain-advanced) (0.6.0) Requirement already satisfied: pydantic-core==2.18.2 in ./env/lib/python3.10/site-packages (from pydantic==2.7.1->autotrain-advanced) (2.18.2) Requirement already satisfied: charset-normalizer<4,>=2 in ./env/lib/python3.10/site-packages (from requests==2.31.0->autotrain-advanced) (2.0.4) Requirement already satisfied: urllib3<3,>=1.21.1 in ./env/lib/python3.10/site-packages (from requests==2.31.0->autotrain-advanced) (2.1.0) Requirement already satisfied: absl-py in ./env/lib/python3.10/site-packages (from rouge-score==0.1.2->autotrain-advanced) (2.1.0) Requirement already satisfied: six>=1.14.0 in ./env/lib/python3.10/site-packages (from rouge-score==0.1.2->autotrain-advanced) (1.16.0) Requirement already satisfied: threadpoolctl>=2.0.0 in ./env/lib/python3.10/site-packages (from scikit-learn==1.4.2->autotrain-advanced) (3.5.0) Requirement already satisfied: grpcio>=1.48.2 in ./env/lib/python3.10/site-packages (from tensorboard==2.16.2->autotrain-advanced) (1.63.0) Requirement already satisfied: markdown>=2.6.8 in ./env/lib/python3.10/site-packages (from tensorboard==2.16.2->autotrain-advanced) (3.6) Requirement already satisfied: setuptools>=41.0.0 in ./env/lib/python3.10/site-packages (from tensorboard==2.16.2->autotrain-advanced) (69.5.1) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in ./env/lib/python3.10/site-packages (from tensorboard==2.16.2->autotrain-advanced) (0.7.2) Requirement already satisfied: tokenizers<0.20,>=0.19 in ./env/lib/python3.10/site-packages (from transformers==4.40.1->autotrain-advanced) (0.19.1) Requirement already satisfied: tyro>=0.5.11 in ./env/lib/python3.10/site-packages (from trl==0.8.6->autotrain-advanced) (0.8.3) Requirement already satisfied: h11>=0.8 in ./env/lib/python3.10/site-packages (from uvicorn==0.29.0->autotrain-advanced) (0.14.0) Requirement already satisfied: MarkupSafe>=2.1.1 in ./env/lib/python3.10/site-packages (from werkzeug==3.0.2->autotrain-advanced) (2.1.3) Requirement already satisfied: pyarrow>=12.0.0 in ./env/lib/python3.10/site-packages (from datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (16.0.0) Requirement already satisfied: pyarrow-hotfix in ./env/lib/python3.10/site-packages (from datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (0.6) Requirement already satisfied: aiohttp in ./env/lib/python3.10/site-packages (from datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (3.9.5) Requirement already satisfied: Mako in ./env/lib/python3.10/site-packages (from alembic>=1.5.0->optuna==3.6.1->autotrain-advanced) (1.3.3) Requirement already satisfied: pycparser in ./env/lib/python3.10/site-packages (from cffi>=1.12->cryptography==42.0.5->autotrain-advanced) (2.22) Requirement already satisfied: aiosignal>=1.1.2 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (1.3.1) Requirement already satisfied: attrs>=17.3.0 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (23.2.0) Requirement already satisfied: frozenlist>=1.1.1 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (1.4.1) Requirement already satisfied: multidict<7.0,>=4.5 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (6.0.5) Requirement already satisfied: yarl<2.0,>=1.0 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (1.9.4) Requirement already satisfied: async-timeout<5.0,>=4.0 in ./env/lib/python3.10/site-packages (from aiohttp->datasets~=2.19.0->datasets[vision]~=2.19.0->autotrain-advanced) (4.0.3) Requirement already satisfied: networkx>=2.8 in ./env/lib/python3.10/site-packages (from scikit-image>=0.21.0->albumentations==1.4.4->autotrain-advanced) (3.1) Requirement already satisfied: imageio>=2.33 in ./env/lib/python3.10/site-packages (from scikit-image>=0.21.0->albumentations==1.4.4->autotrain-advanced) (2.34.1) Requirement already satisfied: tifffile>=2022.8.12 in ./env/lib/python3.10/site-packages (from scikit-image>=0.21.0->albumentations==1.4.4->autotrain-advanced) (2024.5.3) Requirement already satisfied: lazy-loader>=0.4 in ./env/lib/python3.10/site-packages (from scikit-image>=0.21.0->albumentations==1.4.4->autotrain-advanced) (0.4) Requirement already satisfied: greenlet!=0.4.17 in ./env/lib/python3.10/site-packages (from sqlalchemy>=1.3.0->optuna==3.6.1->autotrain-advanced) (3.0.3) Requirement already satisfied: exceptiongroup>=1.0.2 in ./env/lib/python3.10/site-packages (from anyio->httpx==0.27.0->autotrain-advanced) (1.2.1) Requirement already satisfied: sympy in ./env/lib/python3.10/site-packages (from torch>=1.10.0->accelerate==0.29.3->autotrain-advanced) (1.12) Requirement already satisfied: jinja2 in ./env/lib/python3.10/site-packages (from torch>=1.10.0->accelerate==0.29.3->autotrain-advanced) (3.1.3) Requirement already satisfied: docstring-parser>=0.14.1 in ./env/lib/python3.10/site-packages (from tyro>=0.5.11->trl==0.8.6->autotrain-advanced) (0.16) Requirement already satisfied: rich>=11.1.0 in ./env/lib/python3.10/site-packages (from tyro>=0.5.11->trl==0.8.6->autotrain-advanced) (13.7.1) Requirement already satisfied: shtab>=1.5.6 in ./env/lib/python3.10/site-packages (from tyro>=0.5.11->trl==0.8.6->autotrain-advanced) (1.7.1) Requirement already satisfied: types-python-dateutil>=2.8.10 in ./env/lib/python3.10/site-packages (from arrow->codecarbon==2.3.5->autotrain-advanced) (2.9.0.20240316) Requirement already satisfied: zipp>=0.5 in ./env/lib/python3.10/site-packages (from importlib-metadata->diffusers==0.27.2->autotrain-advanced) (3.18.1) Requirement already satisfied: markdown-it-py>=2.2.0 in ./env/lib/python3.10/site-packages (from rich>=11.1.0->tyro>=0.5.11->trl==0.8.6->autotrain-advanced) (3.0.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in ./env/lib/python3.10/site-packages (from rich>=11.1.0->tyro>=0.5.11->trl==0.8.6->autotrain-advanced) (2.18.0) Requirement already satisfied: mpmath>=0.19 in ./env/lib/python3.10/site-packages (from sympy->torch>=1.10.0->accelerate==0.29.3->autotrain-advanced) (1.3.0) Requirement already satisfied: mdurl~=0.1 in ./env/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=11.1.0->tyro>=0.5.11->trl==0.8.6->autotrain-advanced) (0.1.2) Downloading autotrain_advanced-0.7.79-py3-none-any.whl (276 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 276.0/276.0 kB 18.2 MB/s eta 0:00:00 Installing collected packages: autotrain-advanced Successfully installed autotrain-advanced-0.7.79 Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGet{Compute,Graphics,MPSCompute}RunningProcesses. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop. Your installed package nvidia-ml-py is corrupted. Skip patch functions nvmlDeviceGetMemoryInfo. You may get incorrect or incomplete results. Please consider reinstall package nvidia-ml-py via pip3 install --force-reinstall nvidia-ml-py nvitop. INFO | 2024-05-07 14:29:34 | autotrain.app::32 - Starting AutoTrain... WARNING | 2024-05-07 14:29:34 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: valid_split, scheduler, seed, train_split, model, token, gradient_accumulation, batch_size, save_total_limit, warmup_ratio, weight_decay, max_prompt_length, auto_find_batch_size, lora_r, add_eos_token, use_flash_attention_2, rejected_text_column, disable_gradient_checkpointing, project_name, trainer, lora_dropout, lr, data_path, lora_alpha, push_to_hub, model_ref, text_column, prompt_text_column, model_max_length, merge_adapter, dpo_beta, evaluation_strategy, optimizer, username, max_grad_norm, logging_steps WARNING | 2024-05-07 14:29:34 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: target_column, valid_split, scheduler, project_name, seed, lr, train_split, data_path, epochs, model, push_to_hub, token, gradient_accumulation, text_column, batch_size, warmup_ratio, weight_decay, save_total_limit, evaluation_strategy, max_seq_length, auto_find_batch_size, optimizer, username, max_grad_norm, logging_steps WARNING | 2024-05-07 14:29:34 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: target_column, valid_split, scheduler, project_name, seed, lr, train_split, data_path, epochs, model, push_to_hub, token, gradient_accumulation, batch_size, warmup_ratio, weight_decay, save_total_limit, evaluation_strategy, auto_find_batch_size, image_column, optimizer, username, max_grad_norm, logging_steps WARNING | 2024-05-07 14:29:34 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: valid_split, scheduler, seed, train_split, model, token, gradient_accumulation, quantization, batch_size, warmup_ratio, weight_decay, save_total_limit, max_seq_length, auto_find_batch_size, lora_r, logging_steps, target_column, peft, project_name, lora_dropout, lr, data_path, epochs, lora_alpha, push_to_hub, text_column, evaluation_strategy, optimizer, username, max_grad_norm, max_target_length WARNING | 2024-05-07 14:29:34 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: target_columns, valid_split, numerical_columns, project_name, id_column, seed, task, train_split, data_path, model, time_limit, push_to_hub, token, categorical_columns, num_trials, username WARNING | 2024-05-07 14:29:34 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: adam_weight_decay, username, scheduler, class_prompt, seed, logging, adam_beta1, adam_epsilon, model, rank, token, image_path, validation_epochs, class_labels_conditioning, num_validation_images, sample_batch_size, scale_lr, lr_power, dataloader_num_workers, text_encoder_use_attention_mask, class_image_path, prior_loss_weight, tokenizer_max_length, local_rank, project_name, validation_prompt, warmup_steps, epochs, resume_from_checkpoint, xl, checkpoints_total_limit, num_cycles, allow_tf32, adam_beta2, center_crop, push_to_hub, validation_images, prior_generation_precision, revision, prior_preservation, num_class_images, checkpointing_steps, tokenizer, pre_compute_text_embeddings, max_grad_norm WARNING | 2024-05-07 14:29:34 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: username, valid_split, scheduler, project_name, seed, lr, train_split, data_path, epochs, model, push_to_hub, token, gradient_accumulation, batch_size, tokens_column, warmup_ratio, weight_decay, save_total_limit, evaluation_strategy, max_seq_length, auto_find_batch_size, optimizer, tags_column, max_grad_norm, logging_steps WARNING | 2024-05-07 14:29:34 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: target_column, valid_split, scheduler, project_name, seed, lr, train_split, data_path, epochs, model, push_to_hub, token, gradient_accumulation, text_column, batch_size, warmup_ratio, weight_decay, save_total_limit, evaluation_strategy, max_seq_length, auto_find_batch_size, optimizer, username, max_grad_norm, logging_steps INFO | 2024-05-07 14:29:35 | autotrain.app::156 - AutoTrain started successfully INFO | 2024-05-07 14:29:36 | autotrain.app:fetch_params:214 - Task: llm:sft INFO | 2024-05-07 14:30:06 | autotrain.app:handle_form:463 - hardware: Local INFO | 2024-05-07 14:30:06 | autotrain.app:handle_form:554 - Task: lm_training INFO | 2024-05-07 14:30:06 | autotrain.app:handle_form:555 - Column mapping: {'text': 'text'}

Saving the dataset (0/1 shards): 0%| | 0/9 [00:00<?, ? examples/s] Saving the dataset (1/1 shards): 100%|██████████| 9/9 [00:00<00:00, 3624.46 examples/s] Saving the dataset (1/1 shards): 100%|██████████| 9/9 [00:00<00:00, 3390.10 examples/s]

Saving the dataset (0/1 shards): 0%| | 0/9 [00:00<?, ? examples/s] Saving the dataset (1/1 shards): 100%|██████████| 9/9 [00:00<00:00, 4258.66 examples/s] Saving the dataset (1/1 shards): 100%|██████████| 9/9 [00:00<00:00, 3981.51 examples/s] WARNING | 2024-05-07 14:30:06 | autotrain.trainers.common:init:174 - Parameters not supplied by user and set to default: valid_split, seed, train_split, quantization, save_total_limit, warmup_ratio, weight_decay, max_prompt_length, max_completion_length, auto_find_batch_size, lora_r, add_eos_token, use_flash_attention_2, padding, disable_gradient_checkpointing, lora_dropout, lora_alpha, model_ref, merge_adapter, dpo_beta, evaluation_strategy, max_grad_norm, logging_steps INFO | 2024-05-07 14:30:06 | autotrain.backend:create:300 - Starting local training... INFO | 2024-05-07 14:30:06 | autotrain.commands:launch_command:327 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-y6hc2-nrxlz/training_params.json'] INFO | 2024-05-07 14:30:06 | autotrain.commands:launch_command:328 - {'model': 'microsoft/Phi-3-mini-128k-instruct', 'project_name': 'autotrain-y6hc2-nrxlz', 'data_path': 'autotrain-y6hc2-nrxlz/autotrain-data', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 1024, 'model_max_length': 4096, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'lr': 3e-05, 'epochs': 3, 'batch_size': 2, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': 'none', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': 'autotrain_prompt', 'text_column': 'autotrain_text', 'rejected_text_column': 'autotrain_rejected_text', 'push_to_hub': True, 'username': 'bertilmuth', 'token': '*****'} INFO | 2024-05-07 14:30:06 | autotrain.backend:create:305 - Training PID: 65 The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.train_clm_sft:train:14 - Starting SFT training... INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.utils:process_input_data:311 - loading dataset from disk INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.utils:process_input_data:352 - Train data: Dataset({ features: ['autotrain_text'], num_rows: 9 }) INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.utils:process_input_data:353 - Valid data: None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.utils:configure_logging_steps:423 - configuring logging steps INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.utils:configure_logging_steps:436 - Logging steps: 1 INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.utils:configure_training_args:441 - configuring training args INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.utils:configure_block_size:504 - Using block size 1024 INFO | 2024-05-07 14:30:13 | autotrain.trainers.clm.train_clm_sft:train:27 - loading model config... A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:

Downloading shards: 0%| | 0/2 [00:00<?, ?it/s] Downloading shards: 50%|█████ | 1/2 [00:05<00:05, 5.28s/it] Downloading shards: 100%|██████████| 2/2 [00:12<00:00, 6.71s/it] Downloading shards: 100%|██████████| 2/2 [00:12<00:00, 6.50s/it]

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:08<00:08, 8.51s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00, 5.79s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:12<00:00, 6.19s/it] INFO | 2024-05-07 14:30:39 | autotrain.trainers.clm.train_clm_sft:train:66 - model dtype: torch.float16 INFO | 2024-05-07 14:30:39 | autotrain.trainers.clm.train_clm_sft:train:79 - creating trainer

Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 2 examples [00:00, 159.62 examples/s] INFO | 2024-05-07 14:30:40 | autotrain.trainers.common:on_train_begin:231 - Starting to train...

0%| | 0/3 [00:00<?, ?it/s]You are not running the flash-attention implementation, expect numerical differences.

0%| | 0/1 [00:00<?, ?it/s]

events.out.tfevents.1715092240.r-bertilmuth-phi-3-vqu08txi-99074-5gv27.66.0: 0%| | 0.00/8.00k [00:00<?, ?B/s] events.out.tfevents.1715092240.r-bertilmuth-phi-3-vqu08txi-99074-5gv27.66.0: 100%|██████████| 8.00k/8.00k [00:00<00:00, 65.9kB/s]

100%|██████████| 1/1 [00:00<00:00, 4.85it/s] 100%|██████████| 1/1 [00:00<00:00, 4.85it/s]

33%|███▎ | 1/3 [00:06<00:12, 6.40s/it]INFO | 2024-05-07 14:30:47 | autotrain.trainers.common:on_log:226 - {'loss': 0.3232, 'grad_norm': 0.5817663073539734, 'learning_rate': 3e-05, 'epoch': 1.0}

{'loss': 0.3232, 'grad_norm': 0.5817663073539734, 'learning_rate': 3e-05, 'epoch': 1.0}

33%|███▎ | 1/3 [00:06<00:12, 6.40s/it] 67%|██████▋ | 2/3 [00:11<00:05, 5.58s/it]INFO | 2024-05-07 14:30:52 | autotrain.trainers.common:on_log:226 - {'loss': 0.3232, 'grad_norm': 0.5791377425193787, 'learning_rate': 1.5e-05, 'epoch': 2.0}

{'loss': 0.3232, 'grad_norm': 0.5791377425193787, 'learning_rate': 1.5e-05, 'epoch': 2.0}

67%|██████▋ | 2/3 [00:11<00:05, 5.58s/it] 100%|██████████| 3/3 [00:16<00:00, 5.32s/it]INFO | 2024-05-07 14:30:57 | autotrain.trainers.common:on_log:226 - {'loss': 0.3018, 'grad_norm': 0.49286404252052307, 'learning_rate': 0.0, 'epoch': 3.0}

{'loss': 0.3018, 'grad_norm': 0.49286404252052307, 'learning_rate': 0.0, 'epoch': 3.0}

100%|██████████| 3/3 [00:16<00:00, 5.32s/it]INFO | 2024-05-07 14:30:57 | autotrain.trainers.common:on_log:226 - {'train_runtime': 16.4376, 'train_samples_per_second': 0.365, 'train_steps_per_second': 0.183, 'train_loss': 0.3160308400789897, 'epoch': 3.0}

{'train_runtime': 16.4376, 'train_samples_per_second': 0.365, 'train_steps_per_second': 0.183, 'train_loss': 0.3160308400789897, 'epoch': 3.0}

100%|██████████| 3/3 [00:16<00:00, 5.32s/it] 100%|██████████| 3/3 [00:16<00:00, 5.48s/it] INFO | 2024-05-07 14:30:57 | autotrain.trainers.clm.utils:post_training_steps:263 - Finished training, saving model... INFO | 2024-05-07 14:30:59 | autotrain.trainers.clm.utils:post_training_steps:293 - Pushing model to hub...

0%| | 0/4 [00:00<?, ?it/s] 25%|██▌ | 1/4 [00:04<00:14, 4.78s/it] 50%|█████ | 2/4 [00:05<00:04, 2.18s/it] 75%|███████▌ | 3/4 [00:05<00:01, 1.29s/it] 100%|██████████| 4/4 [00:05<00:00, 1.16it/s] 100%|██████████| 4/4 [00:05<00:00, 1.40s/it] INFO | 2024-05-07 14:31:07 | autotrain.trainers.common:pause_space:77 - Pausing space...

Additional Information

I am unsure if it's correct that after finetuning the model, the auotrain space should pause (because that's what happens, see log). I can access the finetuned model, but when I use the GUI for inference, it times out (see screenshot). Access via sending a POST request doesn't work either, it provides an error that it's still loading the model.

abhishekkrthakur commented 2 months ago

it doesnt look like an autotrain issue. for website related stuff please email: website@hf.co