huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.76k stars 2.59k forks source link

Invalid pattern: '**' can only be an entire path component #6737

Closed JPonsa closed 2 months ago

JPonsa commented 4 months ago

Describe the bug

ValueError: Invalid pattern: '**' can only be an entire path component when loading any dataset

Steps to reproduce the bug

import datasets ds = datasets.load_dataset("TokenBender/code_instructions_122k_alpaca_style")

Expected behavior

loading the dataset successfully

Environment info

lhoestq commented 3 months ago

I couldn't reproduce the issue on my side on MacOS, I guess the issue comes from the recent fsspec on Windows.

Can you try downgrading to fsspec==2023.9.2 for now ? It would also be great to investigate this and see if we need a fix in datasets or fsspec

jpaw commented 3 months ago

I had the same issue!
Downgrading to fsspec from 2023.10.0 to 2023.9.2 solved it for me.

(env: python 3.11.7, datasets version: 2.15.0, Windows 10 22H2, Build 19045.4170)

Thanks a lot!

azuryl commented 2 months ago

Ubuntu 20.04 had the same issue python 3.9

File "/home/delight-gpu/Workspace2/azuryl/FLAP/main.py", line 112, in main() File "/home/delight-gpu/Workspace2/azuryl/FLAP/main.py", line 85, in main prune_flap(args, model, tokenizer, device) File "/home/delight-gpu/Workspace2/azuryl/FLAP/lib/prune.py", line 294, in pruneflap dataloader, = get_loaders("wikitext2", nsamples=args.nsamples,seed=args.seed,seqlen=model.seqlen,tokenizer=tokenizer) File "/home/delight-gpu/Workspace2/azuryl/FLAP/lib/data.py", line 159, in get_loaders return get_wikitext2(nsamples, seed, seqlen, tokenizer) File "/home/delight-gpu/Workspace2/azuryl/FLAP/lib/data.py", line 79, in get_wikitext2 traindata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/datasets/load.py", line 1767, in load_dataset builder_instance = load_dataset_builder( File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/datasets/load.py", line 1498, in load_dataset_builder dataset_module = dataset_module_factory( File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/datasets/load.py", line 1215, in dataset_module_factory raise e1 from None File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/datasets/load.py", line 1192, in dataset_module_factory return HubDatasetModuleFactoryWithoutScript( File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/datasets/load.py", line 765, in get_module else get_data_patterns_in_dataset_repository(hfh_dataset_info, self.data_dir) File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/datasets/data_files.py", line 675, in get_data_patterns_in_dataset_repository return _get_data_files_patterns(resolver) File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/datasets/data_files.py", line 236, in _get_data_files_patterns data_files = pattern_resolver(pattern) File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/datasets/data_files.py", line 486, in _resolve_single_pattern_in_dataset_repository glob_iter = [PurePath(filepath) for filepath in fs.glob(PurePath(pattern).as_posix()) if fs.isfile(filepath)] File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/fsspec/spec.py", line 606, in glob pattern = glob_translate(path + ("/" if ends_with_sep else "")) File "/home/azuryl/anaconda3/envs/flap/lib/python3.9/site-packages/fsspec/utils.py", line 734, in glob_translate raise ValueError( ValueError: Invalid pattern: '**' can only be an entire path component

lhoestq commented 2 months ago

on ubuntu you just need to have the latest datasets and fsspec

pip install -U datasets fsspec
albertvillanova commented 2 months ago

The issue was caused by an incompatibility between the versions of datasets, huggingface-hub and fsspec.

The issue was fixed in:

p0lyMth commented 3 days ago

@albertvillanova, thank you for this solution. I encountered the same issue and had to use:

conda install -c conda-forge huggingface_hub=0.21.2 datasets=2.19.1

Cheers