huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.72k stars 26.22k forks source link

Can't saved finetuned model in local machine #26073

Closed 50516017 closed 11 months ago

50516017 commented 12 months ago

System Info

Hi I want to create fine tuning using "rinna/japanese-gpt-neox-3.6b-instruction-ppo" on windows os

However, when I ran training and tried to save, the following error occurred and the model was not saved to output_dir. How should I solve it? I am building an environment using WSL2 and installing bitsandytes using the following. Could that be the cause?

https://github.com/jllllll/bitsandbytes-windows-webui

If this repository is causing problems, shouldn't I be using bitsandbytes in a windows environment?

enviroment

OS: ubuntu22.04 on windows11 using WSL2
GPU:NVIDIA Geforce RTX 4060Ti(16GB)
CPU:AMN Rzen 5 4500 6-core (16GB)

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 537.13       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     On  | 00000000:01:00.0  On |                  N/A |
|  0%   43C    P8              10W / 165W |    478MiB / 16380MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       276      G   /Xwayland                                 N/A      |

pip list

Package                   Version
------------------------- ------------
accelerate                0.20.3
adapter-transformers      3.2.1
aiofiles                  23.2.1
aiohttp                   3.8.5
aiosignal                 1.3.1
altair                    5.1.1
anyio                     4.0.0
appdirs                   1.4.4
async-timeout             4.0.3
attrs                     23.1.0
bitsandbytes              0.39.0
certifi                   2023.7.22
charset-normalizer        3.2.0
click                     8.1.7
cmake                     3.27.4.1
contourpy                 1.1.0
ctranslate2               3.19.0
cycler                    0.11.0
datasets                  2.12.0
dill                      0.3.6
docker-pycreds            0.4.0
exceptiongroup            1.1.3
fastapi                   0.95.2
ffmpy                     0.3.1
filelock                  3.12.3
fonttools                 4.42.1
frozenlist                1.4.0
fsspec                    2023.9.0
gitdb                     4.0.10
GitPython                 3.1.35
gradio                    3.31.0
gradio_client             0.5.0
h11                       0.14.0
httpcore                  0.17.3
httpx                     0.24.1
huggingface-hub           0.16.4
idna                      3.4
Jinja2                    3.1.2
jsonschema                4.19.0
jsonschema-specifications 2023.7.1
kiwisolver                1.4.5
linkify-it-py             2.0.2
lit                       16.0.6
loralib                   0.1.1
markdown-it-py            2.2.0
MarkupSafe                2.1.3
matplotlib                3.7.2
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mpmath                    1.3.0
multidict                 6.0.4
multiprocess              0.70.14
networkx                  3.1
numpy                     1.25.2
nvidia-cublas-cu11        11.10.3.66
nvidia-cuda-cupti-cu11    11.7.101
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cudnn-cu11         8.5.0.96
nvidia-cufft-cu11         10.9.0.58
nvidia-curand-cu11        10.2.10.91
nvidia-cusolver-cu11      11.4.0.1
nvidia-cusparse-cu11      11.7.4.91
nvidia-nccl-cu11          2.14.3
nvidia-nvtx-cu11          11.7.91
orjson                    3.9.5
packaging                 23.1
pandas                    2.1.0
pathtools                 0.1.2
peft                      0.4.0
Pillow                    10.0.0
pip                       23.2.1
protobuf                  3.20.0
psutil                    5.9.5
pyarrow                   13.0.0
pydantic                  1.10.12
pydub                     0.25.1
Pygments                  2.16.1
pyparsing                 3.0.9
python-dateutil           2.8.2
python-multipart          0.0.6
pytz                      2023.3.post1
PyYAML                    6.0.1
referencing               0.30.2
regex                     2023.8.8
requests                  2.31.0
responses                 0.18.0
rpds-py                   0.10.2
safetensors               0.3.3
scipy                     1.10.1
semantic-version          2.10.0
sentencepiece             0.1.99
sentry-sdk                1.30.0
setproctitle              1.3.2
setuptools                59.6.0
six                       1.16.0
smmap                     5.0.0
sniffio                   1.3.0
starlette                 0.27.0
sympy                     1.12
tokenizers                0.13.3
toolz                     0.12.0
torch                     2.0.1
torchaudio                2.0.2
torchvision               0.15.2
tqdm                      4.66.1
transformers              4.33.1
triton                    2.0.0
typing_extensions         4.7.1
tzdata                    2023.3
uc-micro-py               1.0.2
urllib3                   2.0.4
uvicorn                   0.23.2
wandb                     0.15.10
websockets                11.0.3
wheel                     0.41.2
xxhash                    3.3.0
yarl                      1.9.2

Who can help?

@pacman100 : @muellerz

Information

Tasks

Reproduction

execute training code

model_name = "rinna/japanese-gpt-neox-3.6b-instruction-ppo"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
config = AutoConfig.from_pretrained(model_name,use_fast=False)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    config=config,
    device_map="auto",
    #torch_dtype=torch.bfloat16,
    load_in_8bit=True
)

eval_steps = 11
save_steps = 33
logging_steps = 3
MICRO_BATCH_SIZE = 2
BATCH_SIZE = 16

trainer = transformers.Trainer(

    model = model,
    data_collator=collator,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    args=transformers.TrainingArguments(
        num_train_epochs=1,
        #learning_rate=3e-5,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=eval_steps,
        save_steps=save_steps,
        #warmup_ratio=0.15,
        per_device_train_batch_size=MICRO_BATCH_SIZE,
        per_device_eval_batch_size=MICRO_BATCH_SIZE,
        gradient_accumulation_steps=BATCH_SIZE // MICRO_BATCH_SIZE,
        #bf16=True,
        dataloader_num_workers=12,
        logging_steps=logging_steps,
        output_dir="./output",
        #report_to="wandb",
        save_total_limit=1,
        load_best_model_at_end=True,
        greater_is_better=False,
        metric_for_best_model="eval_loss",
        fp16=True,
        auto_find_batch_size=True
    )
)

Expected behavior

error message

bin /lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('unix')}
  warn(msg)
/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA exception! Error code: no CUDA-capable device is detected
CUDA exception! Error code: initialization error
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No GPU detected! Check your CUDA paths. Proceeding to load CPU-only library...
  warn(msg)
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
0
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
/lib/python3.10/site-packages/transformers/models/t5/tokenization_t5.py:283: UserWarning: This sequence already has </s>. In future versions this behavior may lead to duplicated eos tokens being added.

{'loss': 218.7709, 'learning_rate': 2.357142857142857e-05, 'epoch': 0.21}
{'loss': 0.0, 'learning_rate': 1.7142857142857142e-05, 'epoch': 0.42}
{'loss': 0.0, 'learning_rate': 1.0714285714285714e-05, 'epoch': 0.64}
{'eval_loss': nan, 'eval_runtime': 2.0215, 'eval_samples_per_second': 5.442, 'eval_steps_per_second': 2.968, 'epoch': 0.78}
{'loss': 0.0, 'learning_rate': 4.2857142857142855e-06, 'epoch': 0.85}
{'train_runtime': 83.9278, 'train_samples_per_second': 2.693, 'train_steps_per_second': 0.167, 'train_loss': 46.87947736467634, 'epoch': 0.99}
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 14/14 [01:10<00:00,  5.06s/it]/lib/python3.10/site-packages/transformers/modeling_utils.py:1825: UserWarning: You are calling `save_pretrained` to a 8-bit converted model you may likely encounter unexepected behaviors. If you want to save 8-bit models, make sure to have `bitsandbytes>0.37.2` installed.
amyeroberts commented 12 months ago

cc @younesbelkada

younesbelkada commented 12 months ago

Hi @50516017 Thanks a lot for raising this up, There are a couple of issues in your script

1- You are performing pure fine-tuning with the 8-bit model, which is not supported. If you want to train models with 8-bit weights, you need attach adapters on it using peft package. Please have a look at few examples here: https://github.com/huggingface/peft/tree/main/examples/int8_training 2- You are using bitsandbytes compiled on windows, I am not sure how the interaction of that package + transformers will behave. In our case we only support this bitsandbytes package: https://github.com/TimDettmers/bitsandbytes so you might encounter some issues we cannot catch

Can you print the model and share the result here? Thanks!

50516017 commented 11 months ago

I set the LoRa parameters based on the link and executed the learning, and it worked! thank you very much!

younesbelkada commented 11 months ago

Awesome, @50516017 , glad that it worked!