huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.82k stars 2.6k forks source link

load_dataset for CSV files not working #743

Open iliemihai opened 3 years ago

iliemihai commented 3 years ago

Similar to #622, I've noticed there is a problem when trying to load a CSV file with datasets.

from datasets import load_dataset dataset = load_dataset("csv", data_files=["./sample_data.csv"], delimiter="\t", column_names=["title", "text"], script_version="master")

Displayed error: ... ArrowInvalid: CSV parse error: Expected 2 columns, got 1

I should mention that when I've tried to read data from https://github.com/lhoestq/transformers/tree/custom-dataset-in-rag-retriever/examples/rag/test_data/my_knowledge_dataset.csv it worked without a problem. I've read that there might be some problems with /r character, so I've removed them from the custom dataset, but the problem still remains.

I've added a colab reproducing the bug, but unfortunately I cannot provide the dataset. https://colab.research.google.com/drive/1Qzu7sC-frZVeniiWOwzoCe_UHZsrlxu8?usp=sharing

Are there any work around for it ? Thank you

lhoestq commented 3 years ago

Thank you ! Could you provide a csv file that reproduces the error ? It doesn't have to be one of your dataset. As long as it reproduces the error That would help a lot !

iliemihai commented 3 years ago

I think another good example is the following: from datasets import load_dataset dataset = load_dataset("csv", data_files=["./sts-dev.csv"], delimiter="\t", column_names=["one", "two", "three", "four", "score", "sentence1", "sentence2"], script_version="master") `

Displayed error CSV parse error: Expected 7 columns, got 6 even tough I put 7 columns. First four columns from the csv don't have a name, so I've named them by default. The csv file is the .dev file from STSb benchmark dataset.

YipingNUS commented 3 years ago

Hi, seems I also can't read csv file. I was trying with a dummy csv with only three rows.

text,label
I hate google,negative
I love Microsoft,positive
I don't like you,negative

I was using the HuggingFace image in Paperspace Gradient (datasets==1.1.3). The following code doesn't work:

from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",")

It outputs the following:

Using custom data configuration default
Downloading and preparing dataset csv/default-3b6254ff4dd403e5 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-3b6254ff4dd403e5/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2...
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-3b6254ff4dd403e5/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2. Subsequent calls will reuse this data.

But len(dataset) gives 1 and I can't access rows with indexing dataset[0] (it gives KeyError: 0).

However, loading from pandas dataframe is working.

from datasets import Dataset
import pandas as pd
df = pd.read_csv('test_data.csv')
dataset = Dataset.from_pandas(df)
lhoestq commented 3 years ago

This is because load_dataset without split= returns a dictionary of split names (train/validation/test) to dataset. You can do

from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",")
print(dataset["train"][0])

Or if you want to directly get the train split:

from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",", split="train")
print(dataset[0])
thomwolf commented 3 years ago

Good point

Design question for us, though: should load_dataset when no split is specified and only one split is present in the dataset (common use case with CSV/text/JSON datasets) return a Dataset instead of a DatsetDict? I feel like it's often what the user is expecting. I break a bit the paradigm of a unique return type but since this library is designed for widespread DS people more than CS people usage I would tend to think that UX should take precedence over CS reasons. What do you think?

lhoestq commented 3 years ago

In this case the user expects to get only one dataset object instead of the dictionary of datasets since only one csv file was specified without any split specifications. I'm ok with returning the dataset object if no split specifications are given for text/json/csv/pandas.

For the other datasets ton the other hand the user doesn't know in advance the splits so I would keep the dictionary by default. What do you think ?

YipingNUS commented 3 years ago

Thanks for your quick response! I'm fine with specifying the split as @lhoestq suggested. My only concern is when I'm loading from python dict or pandas, the library returns a dataset instead of a dictionary of datasets when no split is specified. I know that they use a different function Dataset.from_dict or Dataset.from_pandas but the text/csv files use load_dataset(). However, to the user, they do the same task and we probably expect them to have the same behavior.

z7ye commented 3 years ago
from datasets import load_dataset
dataset = load_dataset('csv', data_files='./amazon_data/Video_Games_5.csv', delimiter=",", split=['train', 'test'])

I was running the above line, but got this error.

ValueError: Unknown split "test". Should be one of ['train'].

The data is amazon product data. I load the Video_Games_5.json.gz data into pandas and save it as csv file. and then load the csv file using the above code. I thought, split=['train', 'test'] would split the data into train and test. did I misunderstood?

Thank you!

lhoestq commented 3 years ago

Hi ! the split argument in load_dataset is used to select the splits you want among the available splits. However when loading a csv with a single file as you did, only a train split is available by default.

Indeed since data_files='./amazon_data/Video_Games_5.csv' is equivalent to data_files={"train": './amazon_data/Video_Games_5.csv'}, you can get a dataset with

from datasets import load_dataset
dataset = load_dataset('csv', data_files='./amazon_data/Video_Games_5.csv', delimiter=",", split="train")

And then to get both a train and test split you can do

dataset = dataset.train_test_split()
print(dataset.keys())
# ['train', 'test']

Also note that a csv dataset may have several available splits if it is defined this way:

from datasets import load_dataset
dataset = load_dataset('csv', data_files={
    "train": './amazon_data/Video_Games_5_train.csv',
    "test": './amazon_data/Video_Games_5_test.csv'
})
print(dataset.keys())
# ['train', 'test']
thomwolf commented 3 years ago

In this case the user expects to get only one dataset object instead of the dictionary of datasets since only one csv file was specified without any split specifications. I'm ok with returning the dataset object if no split specifications are given for text/json/csv/pandas.

For the other datasets ton the other hand the user doesn't know in advance the splits so I would keep the dictionary by default. What do you think ?

Yes maybe this would be good. I think having to select 'train' from the resulting object why the user gave no split information is a confusing and not intuitive behavior.

kauvinlucas commented 3 years ago

Similar to #622, I've noticed there is a problem when trying to load a CSV file with datasets.

from datasets import load_dataset dataset = load_dataset("csv", data_files=["./sample_data.csv"], delimiter="\t", column_names=["title", "text"], script_version="master")

Displayed error: ... ArrowInvalid: CSV parse error: Expected 2 columns, got 1

I'm also facing the same issue when trying to load from a csv file locally:

from nlp import load_dataset
dataset = load_dataset('csv', data_files='sample_data.csv')

Error when executed from Google Colab:

ArrowInvalid                              Traceback (most recent call last)
<ipython-input-34-79a8d4f65ed6> in <module>()
      1 from nlp import load_dataset
----> 2 dataset = load_dataset('csv', data_files='sample_data.csv')

9 frames
/usr/local/lib/python3.7/dist-packages/nlp/load.py in load_dataset(path, name, version, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)
    547     # Download and prepare data
    548     builder_instance.download_and_prepare(
--> 549         download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
    550     )
    551 

/usr/local/lib/python3.7/dist-packages/nlp/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, **download_and_prepare_kwargs)
    461                 if not downloaded_from_gcs:
    462                     self._download_and_prepare(
--> 463                         dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
    464                     )
    465                 # Sync info

/usr/local/lib/python3.7/dist-packages/nlp/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
    535             try:
    536                 # Prepare split will record examples associated to the split
--> 537                 self._prepare_split(split_generator, **prepare_split_kwargs)
    538             except OSError:
    539                 raise OSError("Cannot find data file. " + (self.manual_download_instructions or ""))

/usr/local/lib/python3.7/dist-packages/nlp/builder.py in _prepare_split(self, split_generator)
    863 
    864         generator = self._generate_tables(**split_generator.gen_kwargs)
--> 865         for key, table in utils.tqdm(generator, unit=" tables", leave=False):
    866             writer.write_table(table)
    867         num_examples, num_bytes = writer.finalize()

/usr/local/lib/python3.7/dist-packages/tqdm/notebook.py in __iter__(self, *args, **kwargs)
    213     def __iter__(self, *args, **kwargs):
    214         try:
--> 215             for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):
    216                 # return super(tqdm...) will not catch exception
    217                 yield obj

/usr/local/lib/python3.7/dist-packages/tqdm/std.py in __iter__(self)
   1102                 fp_write=getattr(self.fp, 'write', sys.stderr.write))
   1103 
-> 1104         for obj in iterable:
   1105             yield obj
   1106             # Update and possibly print the progressbar.

/usr/local/lib/python3.7/dist-packages/nlp/datasets/csv/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b/csv.py in _generate_tables(self, files)
     78                 read_options=self.config.pa_read_options,
     79                 parse_options=self.config.pa_parse_options,
---> 80                 convert_options=self.config.convert_options,
     81             )
     82             yield i, pa_table

/usr/local/lib/python3.7/dist-packages/pyarrow/_csv.pyx in pyarrow._csv.read_csv()

/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: CSV parse error: Expected 1 columns, got 8

Version:

nlp==0.4.0
lhoestq commented 3 years ago

Hi @kauvinlucas

You can use the latest versions of datasets to do this. To do so, just pip install datasets instead of nlp (the library was renamed) and then


from datasets import load_dataset
dataset = load_dataset('csv', data_files='sample_data.csv')
Valerieps commented 3 years ago

Hi I'm having a different problem with loading local csv.

from datasets import load_dataset  
dataset = load_dataset('csv', data_files='sample.csv')  

gives ValueError: Specified named and prefix; you can only specify one. error

versions:

Valerieps commented 3 years ago

Oh.. I figured it out. According to issue #42387 from pandas, this new version does not accept None for both parameters (which was being done by the repo I'm testing). Dowgrading Pandas==1.0.4 and Python==3.8 worked

YuhaoT commented 2 years ago

Hi, I got an OSError: Cannot find data file. when I tried to use load_dataset with tsv files. I have checked the paths, and they are correct.

versions

data_files = {"train": "train.tsv", "test",: "test.tsv"}
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")

The entire Error message is on below:


08/14/2021 16:55:44 - INFO - __main__ -   load a local file for test: /project/media-framing/transformer4/data/unlabel/test.tsv
Using custom data configuration default
Downloading and preparing dataset csv/default-00a4200ae8507533 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /usr4/cs542sp/hey1/.cache/huggingface/datasets/csv/default-00a4200ae8507533/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2...
Traceback (most recent call last):
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 592, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 944, in _prepare_split
    num_examples, num_bytes = writer.finalize()
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/arrow_writer.py", line 307, in finalize
    self.stream.close()
  File "pyarrow/io.pxi", line 132, in pyarrow.lib.NativeFile.close
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: error closing file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_glue.py", line 484, in <module>
    main()
  File "run_glue.py", line 243, in main
    datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/load.py", line 610, in load_dataset
    ignore_verifications=ignore_verifications,
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 515, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 594, in _download_and_prepare
    raise OSError("Cannot find data file. " + (self.manual_download_instructions or ""))
OSError: Cannot find data file. ```
lhoestq commented 2 years ago

Hi ! It looks like the error stacktrace doesn't match with your code snippet.

What error do you get when running this ?

data_files = {"train": "train.tsv", "test",: "test.tsv"}
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")

can you check that both tsv files are in the same folder as the current working directory of your shell ?

YuhaoT commented 2 years ago

Hi @lhoestq, Below is the entire error message after I move both tsv files to the same directory. It's the same with I got before.

/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
08/29/2021 22:56:43 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0distributed training: False, 16-bits training: False
08/29/2021 22:56:43 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=/projectnb/media-framing/pred_result/label1/, overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=True, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=8.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Aug29_22-56-43_scc1, logging_first_step=False, logging_steps=500, save_steps=3000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/projectnb/media-framing/pred_result/label1/, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, _n_gpu=0)
08/29/2021 22:56:43 - INFO - __main__ -   load a local file for train: /project/media-framing/transformer4/temp_train.tsv
08/29/2021 22:56:43 - INFO - __main__ -   load a local file for test: /project/media-framing/transformer4/temp_test.tsv
08/29/2021 22:56:43 - WARNING - datasets.builder -   Using custom data configuration default-df627c23ac0e98ec
Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /usr4/cs542sp/hey1/.cache/huggingface/datasets/csv/default-df627c23ac0e98ec/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...
Traceback (most recent call last):
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 693, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 1166, in _prepare_split
    num_examples, num_bytes = writer.finalize()
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/arrow_writer.py", line 428, in finalize
    self.stream.close()
  File "pyarrow/io.pxi", line 132, in pyarrow.lib.NativeFile.close
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: error closing file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_glue.py", line 487, in <module>
    main()
  File "run_glue.py", line 244, in main
    datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/load.py", line 852, in load_dataset
    use_auth_token=use_auth_token,
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 616, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 699, in _download_and_prepare
    + str(e)
OSError: Cannot find data file. 
Original error:
error closing file
lhoestq commented 2 years ago

Hi ! Can you try running this into a python shell directly ?

import os
from datasets import load_dataset

data_files = {"train": "train.tsv", "test": "test.tsv"}
assert all(os.path.isfile(data_file) for data_file in data_files.values()), "Couln't find files"

datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
print("success !")

This way all the code from run_glue.py doesn't interfere with our tests :)

YuhaoT commented 2 years ago

Hi @lhoestq,

Below is what I got from terminal after I copied and run your code. I think the files themselves are good since there is no assertion error.

Using custom data configuration default-df627c23ac0e98ec
Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /usr4/cs542sp/hey1/.cache/huggingface/datasets/csv/default-df627c23ac0e98ec/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...
Traceback (most recent call last):
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 693, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 1166, in _prepare_split
    num_examples, num_bytes = writer.finalize()
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/arrow_writer.py", line 428, in finalize
    self.stream.close()
  File "pyarrow/io.pxi", line 132, in pyarrow.lib.NativeFile.close
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: error closing file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 7, in <module>
    datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/load.py", line 852, in load_dataset
    use_auth_token=use_auth_token,
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 616, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 699, in _download_and_prepare
    + str(e)
OSError: Cannot find data file. 
Original error:
error closing file
lhoestq commented 2 years ago

Hi, could this be a permission error ? I think it fails to close the arrow file that contains the data from your CSVs in the cache.

By default datasets are cached in ~/.cache/huggingface/datasets, could you check that you have the right permissions ? You can also try to change the cache directory by passing cache_dir="path/to/my/cache/dir" to load_dataset.

YuhaoT commented 2 years ago

Thank you!! @lhoestq

For some reason, I don't have the default path for datasets to cache, maybe because I work from a remote system. The issue solved after I pass the cache_dir argument to the function. Thank you very much!!

bqcuong commented 1 year ago

Hi, could this be a permission error ? I think it fails to close the arrow file that contains the data from your CSVs in the cache.

By default datasets are cached in ~/.cache/huggingface/datasets, could you check that you have the right permissions ? You can also try to change the cache directory by passing cache_dir="path/to/my/cache/dir" to load_dataset.

This is the exact solution I have been finding for the whole afternoon. Thanks a lot! I tried to do a training on a cluster computing system. The user's home directory is shared between nodes. It always gets stuck at dataset loading... The reason might be, the node (with GPU) can't read/write data in the default cache folder (in my home directory). After using an intermediate cache folder, this issue is resolved for me.