huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.91k stars 2.62k forks source link

Couldn't find a dataset script at /content/tsv/tsv.py or any data file in the same directory #6187

Open andysingal opened 11 months ago

andysingal commented 11 months ago

Describe the bug

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-48-6a7b3e847019>](https://localhost:8080/#) in <cell line: 7>()
      5 }
      6 
----> 7 csv_datasets_reloaded = load_dataset("tsv", data_files=data_files)
      8 csv_datasets_reloaded

2 frames
[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1489                     raise e1 from None
   1490                 if isinstance(e1, FileNotFoundError):
-> 1491                     raise FileNotFoundError(
   1492                         f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
   1493                         f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"

FileNotFoundError: Couldn't find a dataset script at /content/tsv/tsv.py or any data file in the same directory. Couldn't find 'tsv' on the Hugging Face Hub either: FileNotFoundError: Dataset 'tsv' doesn't exist on the Hub

Steps to reproduce the bug

data_files = {
    "train": "/content/PUBHEALTH/train.tsv",
    "validation": "/content/PUBHEALTH/dev.tsv",
    "test": "/content/PUBHEALTH/test.tsv",
}

tsv_datasets_reloaded = load_dataset("tsv", data_files=data_files)
tsv_datasets_reloaded
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-48-6a7b3e847019> in <cell line: 7>()
      5 }
      6 
----> 7 csv_datasets_reloaded = load_dataset("tsv", data_files=data_files)
      8 csv_datasets_reloaded

2 frames
/usr/local/lib/python3.10/dist-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1489                     raise e1 from None
   1490                 if isinstance(e1, FileNotFoundError):
-> 1491                     raise FileNotFoundError(
   1492                         f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
   1493                         f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"

FileNotFoundError: Couldn't find a dataset script at /content/tsv/tsv.py or any data file in the same directory. Couldn't find 'tsv' on the Hugging Face Hub either: FileNotFoundError: Dataset 'tsv' doesn't exist on the Hub

Expected behavior

load the data, push to hub

Environment info

jupyter notebook RTX 3090

mariosasko commented 11 months ago

Hi! You can load this dataset with:

data_files = {
    "train": "/content/PUBHEALTH/train.tsv",
    "validation": "/content/PUBHEALTH/dev.tsv",
    "test": "/content/PUBHEALTH/test.tsv",
}

tsv_datasets_reloaded = load_dataset("csv", data_files=data_files, sep="\t")

To support your load_dataset call, defining aliases for the packaged builders, as suggested in https://github.com/huggingface/datasets/issues/5625, must be implemented. We can consider adding this feature if more people request it.

(Also answered on the Discord here)