Failed to load dataset for finetune.py

Tonanguyxiro commented 1 month ago

We try to run the script finetune.py with command export HF_ENDPOINT=https://hf-mirror.com (for proxy) and python finetune.py --task sst2 --model switch-base-8 --benchmark glue --batch_size 64 to run, I just found that the dataset cannot be loaded at the line dataset = load_dataset(args.benchmark ,args.task, cache_dir=f"{config.BASEDIR}/tmp/"), the error is like following, I am interested how you prepare the data before run, do you pre-download the dataset first.

Save model to /home/xxx/project-MoE/test-SiDA-MoE/data/sst2/switch-base-8/finetuned glue sst2 /home/xxx/project-MoE/test-SiDA-MoE/tmp/ Benchmark: glue (type: <class 'str'>) Task: sst2 (type: <class 'str'>) Cache directory: /home/xxx/project-MoE/test-SiDA-MoE/tmp/, typr: <class 'str'> Downloading and preparing dataset None/ax to /home/xxx/.cache/huggingface/datasets/parquet/ax-738ea43827ac551a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec... parquet Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1322.85it/s] Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 284.79it/s] Traceback (most recent call last): File "/home/xxx/project-MoE/test-SiDA-MoE/src/finetune.py", line 193, in dataset = load_dataset(args.benchmark File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/load.py", line 1797, in load_dataset builder_instance.download_and_prepare( File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/builder.py", line 890, in download_and_prepare self._download_and_prepare( File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/builder.py", line 986, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/builder.py", line 1707, in _prepare_split split_info = self.info.splits[split_generator.name] File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/splits.py", line 530, in getitem instructions = make_file_instructions( File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/arrow_reader.py", line 112, in make_file_instructions name2filenames = { File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/arrow_reader.py", line 113, in info.name: filenames_for_dataset_split( File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/naming.py", line 74, in filenames_for_dataset_split prefix = filename_prefix_for_split(dataset_name, split) File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/site-packages/datasets/naming.py", line 55, in filename_prefix_for_split if os.path.basename(name) != name: File "/home/xxx/anaconda3/envs/sida-moe/lib/python3.10/posixpath.py", line 143, in basename p = os.fspath(p) TypeError: expected str, bytes or os.PathLike object, not NoneType

timlee0212 commented 1 month ago

Hi! The huggingface datasets have some changes in the path that breaks the code. You can try to use "nyu-mll/glue" instead of "glue" as the benchmark name. Remember also to update finetune.py#18 for the argument verification

Ref: https://huggingface.co/datasets/nyu-mll/glue

Tonanguyxiro commented 2 days ago

Hi ! Finally I figure out the reason, some incorrect parameters may exist in the previous cache, so the error disapperaed after cleaning the tmp folder.

Thanks for your reply and making this code avaliable for us to learn.

timlee0212 / SiDA-MoE

Failed to load dataset for finetune.py #1