huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
4.04k stars 494 forks source link

[BUG] (Duplicate Flag Generation)_ __main__.py: error: unrecognized arguments: --mixed_precision bf16 -m autotrain.trainers.clm #797

Open unclemusclez opened 1 month ago

unclemusclez commented 1 month ago

Prerequisites

Backend

Local

Interface Used

CLI

CLI Command

autotrain app --host 0.0.0.0 --port 7000

UI Screenshots & Parameters

No response

Error Logs

__main__.py: error: unrecognized arguments: --mixed_precision bf16 -m autotrain.trainers.clm --mixed_precision bf16 -m autotrain.trainers.clm --mixed_precision fp16 -m autotrain.trainers.clm --mixed_precision fp16 -m autotrain.trainers.clm

INFO     | 2024-10-19 23:01:18 | autotrain.commands:launch_command:524 - {'model': 'unsloth/Qwen2.5-Coder-7B-Instruct', 'project_name': 'autotrain-126tb-pvpyu4', 'data_path': 'skratos115/opendevin_DataDevinator', 'train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 2048, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'lr': 1e-06, 'epochs': 1, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': 'none', 'quantization': 'int4', 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': 'prompt', 'text_column': 'text', 'rejected_text_column': 'rejected_text', 'push_to_hub': True, 'username': 'unclemusclez', 'token': '*****', 'unsloth': True, 'distributed_backend': 'none'}
INFO     | 2024-10-19 23:01:18 | autotrain.backends.local:create:25 - Training PID: 57326
INFO:     192.168.2.69:65250 - "POST /ui/create_project HTTP/1.1" 200 OK
INFO:     192.168.2.69:65250 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     192.168.2.69:65250 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     192.168.2.69:65250 - "GET /ui/accelerators HTTP/1.1" 200 OK
usage: __main__.py [-h] --training_config TRAINING_CONFIG
__main__.py: error: unrecognized arguments: --mixed_precision bf16 -m autotrain.trainers.clm --mixed_precision bf16 -m autotrain.trainers.clm --mixed_precision fp16 -m autotrain.trainers.clm --mixed_precision fp16 -m autotrain.trainers.clm
Traceback (most recent call last):
  File "/usr/local/open-webui/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/open-webui/.venv/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/open-webui/.venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1174, in launch_command
    simple_launcher(args)
  File "/usr/local/open-webui/.venv/lib/python3.12/site-packages/accelerate/commands/launch.py", line 769, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/local/open-webui/.venv/bin/python', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-126tb-pvpyu3/training_params.json', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-126tb-pvpyu3/training_params.json', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-126tb-pvpyu3/training_params.json', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-126tb-pvpyu4/training_params.json', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-126tb-pvpyu4/training_params.json']' returned non-zero exit status 2.
INFO:     192.168.2.69:65250 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO     | 2024-10-19 23:01:34 | autotrain.app.utils:get_running_jobs:40 - Killing PID: 57326
INFO     | 2024-10-19 23:01:34 | autotrain.app.utils:kill_process_by_pid:90 - Sent SIGTERM to process with PID 57326

Additional Information

Running Local and it seems to double up the flags, and then keep doing so every time the training is run.

abhishekkrthakur commented 1 month ago

is it an issue with the latest version? 🤔

unclemusclez commented 1 month ago

this occurs when a job fails and you try to run the job again when running the same instance. at the moment, the only solution i have found is turn off the application and turn it back on.

this was from the recent origin main but i noticed this same issue a week or so ago as well with a non-updated version.

It also seems to be particular to SFT training.

github-actions[bot] commented 3 days ago

This issue is stale because it has been open for 30 days with no activity.