huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.84k stars 472 forks source link

[BUG] Error with validation data in !autotrain llm command #591

Closed EdenBD closed 5 months ago

EdenBD commented 5 months ago

Prerequisites

Backend

Colab

Interface Used

CLI

CLI Command

I run the following cli command. The valid-split argument is ignored and always stays None:

%time !autotrain llm \ --train \ --model ${MODEL_NAME} \ --project-name ${PROJECT_NAME} \ --data-path data/ \ --text-column text \ --lr ${LEARNING_RATE} \ --batch-size ${BATCH_SIZE} \ --epochs ${NUM_EPOCHS} \ --block-size ${BLOCK_SIZE} \ --warmup-ratio ${WARMUP_RATIO} \ --lora-r ${LORA_R} \ --lora-alpha ${LORA_ALPHA} \ --lora-dropout ${LORA_DROPOUT} \ --weight-decay ${WEIGHT_DECAY} \ --gradient-accumulation ${GRADIENT_ACCUMULATION} \ --quantization ${QUANTIZATION} \ --mixed-precision ${MIXED_PRECISION} \ --valid-split "validation" \ --log ${LOG} \ --logging-steps ${LOGGING_STEPS} \ $( [[ "$PEFT" == "True" ]] && echo "--peft" ) \ $( [[ "$PUSH_TO_HUB" == "True" ]] && echo "--push-to-hub --token ${HF_TOKEN} --repo-id ${REPO_ID}" )

UI Screenshots & Parameters

The valid-split is always reset to None, and the valid_data is always None (printed from app_params.py).

Running the above cli commands prints:

INFO     | 2024-04-21 22:58:13 | autotrain.cli.run_llm:run:345 - Running LLM
WARNING  | 2024-04-21 22:58:13 | autotrain.trainers.common:__init__:176 - Parameters supplied but not used: deploy, train, version, inference, backend, func
Saving the dataset (1/1 shards): 100% 15/15 [00:00<00:00, 5507.23 examples/s]
Saving the dataset (1/1 shards): 100% 15/15 [00:00<00:00, 7495.18 examples/s]
INFO     | 2024-04-21 22:58:13 | autotrain.backend:create:300 - Starting local training...
INFO     | 2024-04-21 22:58:13 | autotrain.commands:launch_command:338 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.clm', '--training_config', 'my-autotrain-llm/training_params.json']
INFO     | 2024-04-21 22:58:13 | autotrain.commands:launch_command:339 - {'model': 'mistralai/Mistral-7B-v0.1', 'project_name': 'my-autotrain-llm', 'data_path': 'my-autotrain-llm/autotrain-data', 'train_split': 'train', 'valid_split': None, 'add_eos_token': False, 'block_size': 1024, 'model_max_length': 1024, 'padding': None, 'trainer': 'default', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': 5, 'evaluation_strategy': 'epoch', 'save_total_limit': 1, 'save_strategy': 'epoch', 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'lr': 0.0002, 'epochs': 2, 'batch_size': 4, 'warmup_ratio': 0.1, 'gradient_accumulation': 4, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.01, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': 'int4', 'target_modules': None, 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_column': 'autotrain_prompt', 'text_column': 'autotrain_text', 'rejected_text_column': 'autotrain_rejected_text', 'push_to_hub': True, 'repo_id': 'DiNdIn007/autotrain_model_check3', 'username': None, 'token': '*****'}
The following values were not passed to `accelerate launch` and had defaults used instead:
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
INFO     | 2024-04-21 22:58:26 | __main__:process_input_data:76 - loading dataset from disk
INFO     | 2024-04-21 22:58:26 | __main__:process_input_data:117 - Train data: Dataset({
    features: ['autotrain_text'],
    num_rows: 15
})
INFO     | 2024-04-21 22:58:26 | __main__:process_input_data:118 - Valid data: None
INFO     | 2024-04-21 22:58:27 | __main__:train:206 - creating training arguments...
INFO     | 2024-04-21 22:58:27 | __main__:train:220 - Logging steps: 5
INFO     | 2024-04-21 22:58:27 | __main__:train:269 - loading model config...
INFO     | 2024-04-21 22:58:27 | __main__:train:281 - loading model...
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100% 2/2 [01:28<00:00, 44.00s/it]
INFO     | 2024-04-21 22:59:58 | __main__:train:349 - model dtype: torch.float16
INFO     | 2024-04-21 22:59:58 | __main__:train:357 - preparing peft model...
INFO     | 2024-04-21 22:59:58 | __main__:train:415 - Using block size 1024

Error Logs

No error message, but validation data is set to none: INFO | 2024-04-21 22:58:26 | __main__:process_input_data:118 - Valid data: None

Additional Information

How can we include validation data?

I believe we need to update valid_split cli argument processing, but not sure if I'm missing a different way of doing it from utils.py ``llm_munge_data function:

    if params.valid_split is not None:
        valid_data_path = f"{params.data_path}/{params.valid_split}.{ext_to_use}"
abhishekkrthakur commented 5 months ago

duplicate of #462