huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.65k stars 442 forks source link

[BUG] AttributeError: 'NoneType' object has no attribute 'map' #619

Closed asifehmad closed 2 months ago

asifehmad commented 2 months ago

Prerequisites

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config /llama3-8b-orpo.yml

Where llama3-8b-orpo.yml:

task: llm
base_model: meta-llama/Meta-Llama-3-8B-Instruct
project_name: llama3-8b-orpo
log: tensorboard
backend: local-cli

data:
  # path can also be a local folder.
  # if a local folder is provided, the training and validation files
  # must be named "train.csv" and "valid.csv" respectively or
  # "train.jsonl" and "valid.jsonl" respectively.
  # validation split will be ignored for llm training.
  path: argilla/distilabel-capybara-dpo-7k-binarized
  train_split: train
  valid_split: train
  chat_template: chatml
  column_mapping:
    text_column: chosen
    rejected_text_column: rejected

params:
  trainer: orpo
  block_size: 1024
  model_max_length: 8192
  max_prompt_length: 512
  epochs: 3
  batch_size: 2
  lr: 3e-5
  peft: true
  quantization: null
  target_modules: all-linear
  padding: right
  optimizer: adamw_torch
  scheduler: linear
  gradient_accumulation: 4
  mixed_precision: bf16

hub:
  username: ${HF_USERNAME}
  token: ${HF_TOKEN}
  push_to_hub: False

UI Screenshots & Parameters

No response

Error Logs

Map: 100%|███████████████████████████████████████████████████████████████████| 7563/7563 [00:02<00:00, 3128.59 examples/s]
ERROR    | 2024-05-04 21:33:56 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last):
  File "/autotrain-advanced/src/autotrain/trainers/common.py", line 117, in wrapper
    return func(*args, **kwargs)
  File "/autotrain-advanced/src/autotrain/trainers/clm/__main__.py", line 43, in train
    train_orpo(config)
  File "/autotrain-advanced/src/autotrain/trainers/clm/train_clm_orpo.py", line 19, in train
    train_data, valid_data = utils.process_data_with_chat_template(config, tokenizer, train_data, valid_data)
  File "/autotrain-advanced/src/autotrain/trainers/clm/utils.py", line 412, in process_data_with_chat_template
    valid_data = valid_data.map(
AttributeError: 'NoneType' object has no attribute 'map'

ERROR    | 2024-05-04 21:33:56 | autotrain.trainers.common:wrapper:121 - 'NoneType' object has no attribute 'map'

Additional Information

Hey, AbhiShek! Hope you are doing well. I Was trying to reproduce your results but came across this validation split data at first,

ValueError: Unknown split "valid". Should be one of ['train'].

ERROR    | 2024-05-04 21:19:24 | autotrain.trainers.common:wrapper:121 - Unknown split "valid". Should be one of ['train']. 

To fix this I passed valid_split: train just to check it, it worked but I got the above pasted error related to map. Please check it and suggest some fix, thanks!

abhishekkrthakur commented 2 months ago

please set valid_split toe null. ill fix the config. sorry about that

asifehmad commented 2 months ago

please set valid_split toe null. ill fix the config. sorry about that

True, it worked! Thanks for the quick response.

I have some other concerns, is it applicable to only such dataset format? I have my own data set in .csv file with only two columns prompt and response, what do you suggest to train llama-3 on that dataset using this auto-train?

And another one: where can I pass sequence length? I am getting this kind of warning. Thanks again!

abhishekkrthakur commented 2 months ago

some examples and docs are available here: hf.co/docs/autotrain. if your questions still remain unanswered, please let me know and im happy to help. :)

asifehmad commented 2 months ago

some examples and docs are available here: hf.co/docs/autotrain. if your questions still remain unanswered, please let me know and im happy to help. :)

This doc is very helpful, thanks a lot!