jondurbin / bagel

A bagel, with everything.
313 stars 31 forks source link

Error when loading dataset #4

Closed YixinSong-e closed 10 months ago

YixinSong-e commented 10 months ago
File "/mnt/lustre/share_data/songyixin/bagel/bagel/tune/sft.py", line 712, in train
    data_module = make_data_module(tokenizer=tokenizer, args=args)                                                                             
  File "/mnt/lustre/share_data/songyixin/bagel/bagel/tune/sft.py", line 632, in make_data_module
    dataset = Dataset.from_parquet(args.dataset, test_size=args.eval_dataset_size)
  File "/mnt/petrelfs/songyixin/miniconda3/envs/lmf/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1179, in from_parquet
    return ParquetDatasetReader(
  File "/mnt/petrelfs/songyixin/miniconda3/envs/lmf/lib/python3.10/site-packages/datasets/io/parquet.py", line 76, in __init__
    self.builder = Parquet(
  File "/mnt/petrelfs/songyixin/miniconda3/envs/lmf/lib/python3.10/site-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(         
  File "/mnt/petrelfs/songyixin/miniconda3/envs/lmf/lib/python3.10/site-packages/datasets/builder.py", line 553, in _create_builder_config
    builder_config = self.BUILDER_CONFIG_CLASS(**config_kwargs)
TypeError: ParquetConfig.__init__() got an unexpected keyword argument 'test_size'
jondurbin commented 10 months ago

I've been using https://github.com/jondurbin/qlora (train.py) for the SFT phase. For now I copied over the original to this repo, will look at minifying it again.

YixinSong-e commented 10 months ago

Very nice work! Currently, my SFT-training has started running. By the way, if I want to enhance the capabilities of MMLU, do you have any recommended datasets?

jondurbin commented 10 months ago

Good question. You would need to capture the results of the MMLU benchmark, then identify the specific categories/topics the model is underperforming in. Once you identify the area(s) the model is lacking in, it could be as simple as including an existing dataset that covers the topic(s), or perhaps generating a synthetic Q/A pair dataset from wikipedia articles or the like.

YixinSong-e commented 10 months ago

Thanks! :)