[BUG] LLM-ORPO training always failing due to 'Tokenizer' argument

jmparejaz commented 1 week ago

Prerequisites

[X] I have read the documentation.
[X] I have checked other issues for similar problems.

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

Error Logs

INFO: Started server process [3553] INFO: Waiting for application startup. INFO | 2024-11-09 16:40:30 | autotrain.commands:launch_command:523 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-ihh1b-19lgj/training_params.json'] INFO | 2024-11-09 16:40:30 | autotrain.commands:launch_command:524 - {'model': 'Qwen/Qwen2.5-7B', 'project_name': 'autotrain-ihh1b-19lgj', 'data_path': 'mlabonne/orpo-dpo-mix-40k', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 32768, 'padding': 'right', 'trainer': 'orpo', 'use_flash_attention_2': True, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 0.0003, 'epochs': 1, 'batch_size': 5, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_bnb_8bit', 'scheduler': 'cosine', 'weight_decay': 0.05, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': '{%- if tools %}\n {{- \'<|im_start|>system\n\' }}\n {%- if messages[0][\'role\'] == \'system\' %}\n {{- messages[0][\'content\'] }}\n {%- else %}\n {{- \'You are a helpful assistant.\' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}\n {%- for tool in tools %}\n {{- "\n" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- "\n\n\nFor each function call, return a json object with function name and arguments within XML tags:\n\n{\"name\": , \"arguments\": }\n<|im_end|>\n" }}\n{%- else %}\n {%- if messages[0][\'role\'] == \'system\' %}\n {{- \'<|im_start|>system\n\' + messages[0][\'content\'] + \'<|im_end|>\n\' }}\n {%- else %}\n {{- \'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n\' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n {{- \'<|im_start|>\' + message.role + \'\n\' + message.content + \'<|im_end|>\' + \'\n\' }}\n {%- elif message.role == "assistant" %}\n {{- \'<|im_start|>\' + message.role }}\n {%- if message.content %}\n {{- \'\n\' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- \'\n\n{"name": "\' }}\n {{- tool_call.name }}\n {{- \'", "arguments": \' }}\n {{- tool_call.arguments | tojson }}\n {{- \'}\n\' }}\n {%- endfor %}\n {{- \'<|im_end|>\n\' }}\n {%- elif message.role == "tool" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n {{- \'<|im_start|>user\' }}\n {%- endif %}\n {{- \'\n\n\' }}\n {{- message.content }}\n {{- \'\n\' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n {{- \'<|im_end|>\n\' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- \'<|im_start|>assistant\n\' }}\n{%- endif %}\n', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': True, 'peft': True, 'lora_r': 16, 'lora_alpha': 64, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 8192, 'max_completion_length': 4096, 'prompt_text_column': 'prompt', 'text_column': 'chosen', 'rejected_text_column': 'rejected', 'push_to_hub': True, 'username': 'xxxxxx', 'token': '*****', 'unsloth': False, 'distributed_backend': 'ddp'} INFO | 2024-11-09 16:40:30 | autotrain.app.training_api:lifespan:82 - Started training with PID 3619 INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit) The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. [2024-11-09 16:40:34,843] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO | 2024-11-09 16:40:35 | autotrain.trainers.clm.train_clm_orpo:train:11 - Starting ORPO training...

Generating train split: 0%| | 0/44245 [00:00<?, ? examples/s] Generating train split: 9%|▉ | 4000/44245 [00:00<00:01, 29915.24 examples/s] Generating train split: 29%|██▉ | 13000/44245 [00:00<00:00, 52357.45 examples/s] Generating train split: 52%|█████▏ | 23000/44245 [00:00<00:00, 66570.55 examples/s] Generating train split: 72%|███████▏ | 32000/44245 [00:00<00:00, 69872.51 examples/s] Generating train split: 100%|██████████| 44245/44245 [00:00<00:00, 78070.03 examples/s] INFO | 2024-11-09 16:40:41 | autotrain.trainers.clm.utils:process_input_data:546 - Train data: Dataset({ features: ['source', 'chosen', 'rejected', 'prompt', 'question'], num_rows: 44245 }) INFO | 2024-11-09 16:40:41 | autotrain.trainers.clm.utils:process_input_data:547 - Valid data: None INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_logging_steps:667 - configuring logging steps INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_logging_steps:680 - Logging steps: 25 INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_training_args:719 - configuring training args INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_block_size:797 - Using block size 2048 INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:873 - Can use unsloth: False WARNING | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:915 - Unsloth not available, continuing without it... INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:917 - loading model config... INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:925 - loading model... low_cpu_mem_usage was None, now default to True since model is quantized. INFO | 2024-11-09 16:41:52 | autotrain.backends.nvcf:_poll_nvcf:119 - xxxxx-autotrain-ihh1b-19lgj: GET - 202 - Polling reqId for completion INFO | 2024-11-09 16:42:27 | autotrain.backends.nvcf:_poll_nvcf:119 - xxxxxx-autotrain-ihh1b-19lgj: GET - 202 - Polling reqId for completion INFO | 2024-11-09 16:42:47 | autotrain.backends.nvcf:_poll_nvcf:119 - xxxxxx-autotrain-ihh1b-19lgj: GET - 200 - Polling completed 1-09 16:40:30 | autotrain.app.training_api::95 - AUTOTRAIN_USERNAME: XXXXXX INFO | 2024-11-09 16:40:30 | autotrain.app.training_api::96 - PROJECT_NAME: autotrain-ihh1b-19lgj INFO | 2024-11-09 16:40:30 | autotrain.app.training_api::97 - TASK_ID: 9 INFO | 2024-11-09 16:40:30 | autotrain.app.training_api::98 - DATA_PATH: mlabonne/orpo-dpo-mix-40k INFO | 2024-11-09 16:40:30 | autotrain.app.training_api::99 - MODEL: Qwen/Qwen2.5-7B INFO: Started server process [3553] INFO: Waiting for application startup. INFO | 2024-11-09 16:40:30 | autotrain.commands:launch_command:523 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-ihh1b-19lgj/training_params.json'] INFO | 2024-11-09 16:40:30 | autotrain.commands:launch_command:524 - {'model': 'Qwen/Qwen2.5-7B', 'project_name': 'autotrain-ihh1b-19lgj', 'data_path': 'mlabonne/orpo-dpo-mix-40k', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 32768, 'padding': 'right', 'trainer': 'orpo', 'use_flash_attention_2': True, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 0.0003, 'epochs': 1, 'batch_size': 5, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_bnb_8bit', 'scheduler': 'cosine', 'weight_decay': 0.05, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': '{%- if tools %}\n {{- \'<|im_start|>system\n\' }}\n {%- if messages[0][\'role\'] == \'system\' %}\n {{- messages[0][\'content\'] }}\n {%- else %}\n {{- \'You are a helpful assistant.\' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}\n {%- for tool in tools %}\n {{- "\n" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- "\n\n\nFor each function call, return a json object with function name and arguments within XML tags:\n\n{\"name\": , \"arguments\": }\n<|im_end|>\n" }}\n{%- else %}\n {%- if messages[0][\'role\'] == \'system\' %}\n {{- \'<|im_start|>system\n\' + messages[0][\'content\'] + \'<|im_end|>\n\' }}\n {%- else %}\n {{- \'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n\' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n {{- \'<|im_start|>\' + message.role + \'\n\' + message.content + \'<|im_end|>\' + \'\n\' }}\n {%- elif message.role == "assistant" %}\n {{- \'<|im_start|>\' + message.role }}\n {%- if message.content %}\n {{- \'\n\' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- \'\n\n{"name": "\' }}\n {{- tool_call.name }}\n {{- \'", "arguments": \' }}\n {{- tool_call.arguments | tojson }}\n {{- \'}\n\' }}\n {%- endfor %}\n {{- \'<|im_end|>\n\' }}\n {%- elif message.role == "tool" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n {{- \'<|im_start|>user\' }}\n {%- endif %}\n {{- \'\n\n\' }}\n {{- message.content }}\n {{- \'\n\' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n {{- \'<|im_end|>\n\' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- \'<|im_start|>assistant\n\' }}\n{%- endif %}\n', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': True, 'peft': True, 'lora_r': 16, 'lora_alpha': 64, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 8192, 'max_completion_length': 4096, 'prompt_text_column': 'prompt', 'text_column': 'chosen', 'rejected_text_column': 'rejected', 'push_to_hub': True, 'username': 'xxxxxx', 'token': '*****', 'unsloth': False, 'distributed_backend': 'ddp'} INFO | 2024-11-09 16:40:30 | autotrain.app.training_api:lifespan:82 - Started training with PID 3619 INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:7860 (Press CTRL+C to quit) The following values were not passed to accelerate launch and had defaults used instead: --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. [2024-11-09 16:40:34,843] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO | 2024-11-09 16:40:35 | autotrain.trainers.clm.train_clm_orpo:train:11 - Starting ORPO training...

Generating train split: 0%| | 0/44245 [00:00<?, ? examples/s] Generating train split: 9%|▉ | 4000/44245 [00:00<00:01, 29915.24 examples/s] Generating train split: 29%|██▉ | 13000/44245 [00:00<00:00, 52357.45 examples/s] Generating train split: 52%|█████▏ | 23000/44245 [00:00<00:00, 66570.55 examples/s] Generating train split: 72%|███████▏ | 32000/44245 [00:00<00:00, 69872.51 examples/s] Generating train split: 100%|██████████| 44245/44245 [00:00<00:00, 78070.03 examples/s] INFO | 2024-11-09 16:40:41 | autotrain.trainers.clm.utils:process_input_data:546 - Train data: Dataset({ features: ['source', 'chosen', 'rejected', 'prompt', 'question'], num_rows: 44245 }) INFO | 2024-11-09 16:40:41 | autotrain.trainers.clm.utils:process_input_data:547 - Valid data: None INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_logging_steps:667 - configuring logging steps INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_logging_steps:680 - Logging steps: 25 INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_training_args:719 - configuring training args INFO | 2024-11-09 16:40:53 | autotrain.trainers.clm.utils:configure_block_size:797 - Using block size 2048 INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:873 - Can use unsloth: False WARNING | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:915 - Unsloth not available, continuing without it... INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:917 - loading model config... INFO | 2024-11-09 16:40:54 | autotrain.trainers.clm.utils:get_model:925 - loading model... low_cpu_mem_usage was None, now default to True since model is quantized.

Downloading shards: 0%| | 0/4 [00:00<?, ?it/s] Downloading shards: 25%|██▌ | 1/4 [00:10<00:31, 10.53s/it] Downloading shards: 50%|█████ | 2/4 [00:55<01:00, 30.50s/it] Downloading shards: 75%|███████▌ | 3/4 [01:06<00:21, 21.66s/it] Downloading shards: 100%|██████████| 4/4 [01:12<00:00, 15.57s/it] Downloading shards: 100%|██████████| 4/4 [01:12<00:00, 18.09s/it] The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use attn_implementation="flash_attention_2" instead.

Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:03<00:09, 3.30s/it] Loading checkpoint shards: 50%|█████ | 2/4 [00:06<00:06, 3.14s/it] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:09<00:03, 3.05s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:12<00:00, 2.97s/it] Loading checkpoint shards: 100%|██████████| 4/4 [00:12<00:00, 3.03s/it] INFO | 2024-11-09 16:42:21 | autotrain.trainers.clm.utils:get_model:956 - model dtype: torch.float16 INFO | 2024-11-09 16:42:22 | autotrain.trainers.clm.train_clm_orpo:train:39 - creating trainer ERROR | 2024-11-09 16:42:22 | autotrain.trainers.common:wrapper:215 - train has failed due to an exception: Traceback (most recent call last): File "/app/src/autotrain/trainers/common.py", line 212, in wrapper return func(*args, kwargs) File "/app/src/autotrain/trainers/clm/main.py", line 43, in train train_orpo(config) File "/app/src/autotrain/trainers/clm/train_clm_orpo.py", line 47, in train trainer = ORPOTrainer( TypeError: ORPOTrainer.init() got an unexpected keyword argument 'tokenizer' ERROR | 2024-11-09 16:42:22 | autotrain.trainers.common:wrapper:216 - ORPOTrainer.init() got an unexpected keyword argument 'tokenizer' INFO | 2024-11-09 16:42:30 | autotrain.app.utils:get_running_jobs:40 - Killing PID: 3619 INFO | 2024-11-09 16:42:30 | autotrain.app.utils:kill_process_by_pid:90 - Sent SIGTERM to process with PID 3619 INFO | 2024-11-09 16:42:30 | autotrain.app.training_api:run_main:56 - No running jobs found. Shutting down the server. INFO | 2024-11-09 16:42:30 | autotrain.app.training_api:graceful_exit:35 - SIGTERM received. Performing cleanup... ERROR: Traceback (most recent call last): File "/app/env/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/app/env/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/app/env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/app/env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/app/env/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/app/src/autotrain/app/training_api.py", line 57, in run_main kill_process_by_pid(os.getpid()) File "/app/src/autotrain/app/utils.py", line 89, in kill_process_by_pid os.kill(pid, signal.SIGTERM) File "/app/src/autotrain/app/training_api.py", line 36, in graceful_exit sys.exit(0) SystemExit: 0 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/starlette/routing.py", line 700, in lifespan await receive() File "/app/env/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 137, in receive return await self.receive_queue.get() File "/app/env/lib/python3.10/asyncio/queues.py", line 159, in get await getter asyncio.exceptions.CancelledError Task exception was never retrieved future: <Task finished name='Task-3' coro=<BackgroundRunner.run_main() done, defined at /app/src/autotrain/app/training_api.py:52> exception=SystemExit(0)> Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/app/env/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, *ctx.params) File "/app/env/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/app/env/lib/python3.10/site-packages/uvicorn/main.py", line 412, in main run( File "/app/env/lib/python3.10/site-packages/uvicorn/main.py", line 579, in run server.run() File "/app/env/lib/python3.10/site-packages/uvicorn/server.py", line 65, in run return asyncio.run(self.serve(sockets=sockets)) File "/app/env/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/app/env/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/app/env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/app/env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/app/env/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/app/src/autotrain/app/training_api.py", line 57, in run_main kill_process_by_pid(os.getpid()) File "/app/src/autotrain/app/utils.py", line 89, in kill_process_by_pid os.kill(pid, signal.SIGTERM) File "/app/src/autotrain/app/training_api.py", line 36, in graceful_exit sys.exit(0) SystemExit: 0

Additional Information

It initially worked and the trainer was running for over 9hrs and reached 95% of progress on all 7 epochs but then it didnt updated to the repo with more commits, I waited 4 hours and then decided to restart the space.

after that, I tried to replicate my initial training with the same paremeters and never worked, alwasy the same error about the tokenizer argument. Then I have been trying to make it work tweaking Hyperparameters, modes, datasets, gpus etc and never worked.

abhishekkrthakur commented 1 week ago

fixed in version 0.8.30 and above. thank you for reporting it.

jmparejaz commented 1 week ago

it is breaking with the same error :'( autotrain_error1 autotrain_error2

abhishekkrthakur commented 1 week ago

you need to factory reset your space to make sure you are on the latest version

jmparejaz commented 1 week ago

yeah I did it. And test it multiple times before posting here. I even changed my browser to avoid any cache or cookies or whatever... I highlighted in the first image of my previous message that it is installing the 8.8.31.dev8 version of Autotrain advanced. But anyways, I am going to factory reset again

abhishekkrthakur commented 1 week ago

trying on my end too as we speak.

abhishekkrthakur commented 1 week ago

oh wait, you are running on DGX Cloud?

jmparejaz commented 1 week ago

autotrain with dgx cloud with hf space

abhishekkrthakur commented 1 week ago

ahh... something happened and the latest version was not deployed on dgx cloud. ive created another deployment just now and it should start working after ~45-60minutes. it seems i overlooked the screenshot you posted in OP. Apologies for the inconvenience. it will work after the aforementioned time. ive tested it in spaces.

jmparejaz commented 1 week ago

Dont worry Abishek thanks for your support! Autotrain with dgx is amazing!

jmparejaz commented 1 week ago

it is working now with dgx but metrics are not being reported to tensorboard

it finished the training but without metrics logging it is like walking in a dark forest :/

abhishekkrthakur commented 1 week ago

sometimes tensorboard tab takes a while to load. i see you have "runs" folder so metrics are available. ill let the team know if this persists

jmparejaz commented 1 week ago

it persists...the tesorboard is not loading any metrics. I think that metrics are not being loaded to the run folder, check the event file events.out.tfevents.1731348305.0-sr-06e8e06b-d870-4c1a-b787-fa215ea35067.1973.0.zip

In the past I tried to use the argument "report_to" : "wandb" but it didn't worked I set WANDB_API_KEY, WANDB_LOG_MODEL and WANDB_PROJECT secret vars is there any way to use the autotrain UI with WANDB?

jmparejaz commented 1 week ago

I am closing this issue and creating a new one for the tensorboard problem

huggingface / autotrain-advanced