locuslab / tofu

Landing Page for TOFU
MIT License
98 stars 24 forks source link

Unable to load the dataset from HuggingFace hub, throws a ValueError #1

Closed archit31uniyal closed 9 months ago

archit31uniyal commented 10 months ago

I have trying to fine-tune the llama2-7B model following the instructions provided in the repository and it throws the following ValueError while loading the TOFU dataset.

Error executing job with overrides: ['split=full', 'batch_size=4', 'gradient_accumulation_steps=4', 'model_family=llama2-7b', 'lr=1e-5']
Traceback (most recent call last):
  File "/p/compressionleakage/llm_privacy/tofu/finetune.py", line 137, in <module>
    main()
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/p/compressionleakage/.conda/envs/tofu/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/p/compressionleakage/llm_privacy/tofu/finetune.py", line 62, in main
    torch_format_dataset = TextDatasetQA(cfg.data_path, tokenizer=tokenizer, model_family = cfg.model_family, max_length=max_length, split=cfg.split)
  File "/p/compressionleakage/llm_privacy/tofu/data_module.py", line 118, in __init__
    self.data = datasets.load_dataset(data_path, split)["train"]
  File "/u/deu9yh/.local/lib/python3.10/site-packages/datasets/load.py", line 1687, in load_dataset
    builder_instance.download_and_prepare(
  File "/u/deu9yh/.local/lib/python3.10/site-packages/datasets/builder.py", line 605, in download_and_prepare
    self._download_and_prepare(
  File "/u/deu9yh/.local/lib/python3.10/site-packages/datasets/builder.py", line 694, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/u/deu9yh/.local/lib/python3.10/site-packages/datasets/builder.py", line 1154, in _prepare_split
    writer.write_table(table)
  File "/u/deu9yh/.local/lib/python3.10/site-packages/datasets/arrow_writer.py", line 508, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/u/deu9yh/.local/lib/python3.10/site-packages/datasets/table.py", line 1858, in table_cast
    return cast_table_to_schema(table, schema)
  File "/u/deu9yh/.local/lib/python3.10/site-packages/datasets/table.py", line 1840, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
question: string
answer: string
paraphrased_answer: string
perturbed_answer: list<item: string>
  child 0, item: string
paraphrased_question: string
to
{'question': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None)}
because column names don't match

It could be a bug in the code or a problem from HuggingFace side. Issue needs to be investigated for smooth execution of the codebase.

Thank you.

pratyushmaini commented 10 months ago

Hi Archit,

Thank you for your interest in our work, and apologies for the delay in getting back to you. It looks like you are using "split" argument as "full". However, "full" is the name of the subset. Can you confirm that you are loading the dataset in the way it is specified in the Readme?

from datasets import load_dataset 
dataset = load_dataset("locuslab/TOFU", "full")