Open iliemihai opened 3 years ago
Thank you ! Could you provide a csv file that reproduces the error ? It doesn't have to be one of your dataset. As long as it reproduces the error That would help a lot !
I think another good example is the following:
from datasets import load_dataset
dataset = load_dataset("csv", data_files=["./sts-dev.csv"], delimiter="\t", column_names=["one", "two", "three", "four", "score", "sentence1", "sentence2"], script_version="master")
`
Displayed error CSV parse error: Expected 7 columns, got 6
even tough I put 7 columns. First four columns from the csv don't have a name, so I've named them by default. The csv file is the .dev file from STSb benchmark dataset.
Hi, seems I also can't read csv file. I was trying with a dummy csv with only three rows.
text,label
I hate google,negative
I love Microsoft,positive
I don't like you,negative
I was using the HuggingFace image in Paperspace Gradient (datasets==1.1.3). The following code doesn't work:
from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",")
It outputs the following:
Using custom data configuration default
Downloading and preparing dataset csv/default-3b6254ff4dd403e5 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-3b6254ff4dd403e5/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2...
Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-3b6254ff4dd403e5/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2. Subsequent calls will reuse this data.
But len(dataset)
gives 1
and I can't access rows with indexing dataset[0]
(it gives KeyError: 0
).
However, loading from pandas dataframe is working.
from datasets import Dataset
import pandas as pd
df = pd.read_csv('test_data.csv')
dataset = Dataset.from_pandas(df)
This is because load_dataset without split=
returns a dictionary of split names (train/validation/test) to dataset.
You can do
from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",")
print(dataset["train"][0])
Or if you want to directly get the train split:
from datasets import load_dataset
dataset = load_dataset('csv', script_version="master", data_files=['test_data.csv'], delimiter=",", split="train")
print(dataset[0])
Good point
Design question for us, though: should load_dataset
when no split is specified and only one split is present in the dataset (common use case with CSV/text/JSON datasets) return a Dataset
instead of a DatsetDict
? I feel like it's often what the user is expecting. I break a bit the paradigm of a unique return type but since this library is designed for widespread DS people more than CS people usage I would tend to think that UX should take precedence over CS reasons. What do you think?
In this case the user expects to get only one dataset object instead of the dictionary of datasets since only one csv file was specified without any split specifications. I'm ok with returning the dataset object if no split specifications are given for text/json/csv/pandas.
For the other datasets ton the other hand the user doesn't know in advance the splits so I would keep the dictionary by default. What do you think ?
Thanks for your quick response! I'm fine with specifying the split as @lhoestq suggested. My only concern is when I'm loading from python dict or pandas, the library returns a dataset instead of a dictionary of datasets when no split is specified. I know that they use a different function Dataset.from_dict
or Dataset.from_pandas
but the text/csv files use load_dataset()
. However, to the user, they do the same task and we probably expect them to have the same behavior.
from datasets import load_dataset
dataset = load_dataset('csv', data_files='./amazon_data/Video_Games_5.csv', delimiter=",", split=['train', 'test'])
I was running the above line, but got this error.
ValueError: Unknown split "test". Should be one of ['train'].
The data is amazon product data. I load the Video_Games_5.json.gz data into pandas and save it as csv file. and then load the csv file using the above code. I thought, split=['train', 'test']
would split the data into train and test. did I misunderstood?
Thank you!
Hi ! the split
argument in load_dataset
is used to select the splits you want among the available splits.
However when loading a csv with a single file as you did, only a train
split is available by default.
Indeed since data_files='./amazon_data/Video_Games_5.csv'
is equivalent to data_files={"train": './amazon_data/Video_Games_5.csv'}
, you can get a dataset with
from datasets import load_dataset
dataset = load_dataset('csv', data_files='./amazon_data/Video_Games_5.csv', delimiter=",", split="train")
And then to get both a train and test split you can do
dataset = dataset.train_test_split()
print(dataset.keys())
# ['train', 'test']
Also note that a csv dataset may have several available splits if it is defined this way:
from datasets import load_dataset
dataset = load_dataset('csv', data_files={
"train": './amazon_data/Video_Games_5_train.csv',
"test": './amazon_data/Video_Games_5_test.csv'
})
print(dataset.keys())
# ['train', 'test']
In this case the user expects to get only one dataset object instead of the dictionary of datasets since only one csv file was specified without any split specifications. I'm ok with returning the dataset object if no split specifications are given for text/json/csv/pandas.
For the other datasets ton the other hand the user doesn't know in advance the splits so I would keep the dictionary by default. What do you think ?
Yes maybe this would be good. I think having to select 'train' from the resulting object why the user gave no split information is a confusing and not intuitive behavior.
Similar to #622, I've noticed there is a problem when trying to load a CSV file with datasets.
from datasets import load_dataset
dataset = load_dataset("csv", data_files=["./sample_data.csv"], delimiter="\t", column_names=["title", "text"], script_version="master")
Displayed error:
... ArrowInvalid: CSV parse error: Expected 2 columns, got 1
I'm also facing the same issue when trying to load from a csv file locally:
from nlp import load_dataset
dataset = load_dataset('csv', data_files='sample_data.csv')
Error when executed from Google Colab:
ArrowInvalid Traceback (most recent call last)
<ipython-input-34-79a8d4f65ed6> in <module>()
1 from nlp import load_dataset
----> 2 dataset = load_dataset('csv', data_files='sample_data.csv')
9 frames
/usr/local/lib/python3.7/dist-packages/nlp/load.py in load_dataset(path, name, version, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, save_infos, **config_kwargs)
547 # Download and prepare data
548 builder_instance.download_and_prepare(
--> 549 download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
550 )
551
/usr/local/lib/python3.7/dist-packages/nlp/builder.py in download_and_prepare(self, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, **download_and_prepare_kwargs)
461 if not downloaded_from_gcs:
462 self._download_and_prepare(
--> 463 dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
464 )
465 # Sync info
/usr/local/lib/python3.7/dist-packages/nlp/builder.py in _download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs)
535 try:
536 # Prepare split will record examples associated to the split
--> 537 self._prepare_split(split_generator, **prepare_split_kwargs)
538 except OSError:
539 raise OSError("Cannot find data file. " + (self.manual_download_instructions or ""))
/usr/local/lib/python3.7/dist-packages/nlp/builder.py in _prepare_split(self, split_generator)
863
864 generator = self._generate_tables(**split_generator.gen_kwargs)
--> 865 for key, table in utils.tqdm(generator, unit=" tables", leave=False):
866 writer.write_table(table)
867 num_examples, num_bytes = writer.finalize()
/usr/local/lib/python3.7/dist-packages/tqdm/notebook.py in __iter__(self, *args, **kwargs)
213 def __iter__(self, *args, **kwargs):
214 try:
--> 215 for obj in super(tqdm_notebook, self).__iter__(*args, **kwargs):
216 # return super(tqdm...) will not catch exception
217 yield obj
/usr/local/lib/python3.7/dist-packages/tqdm/std.py in __iter__(self)
1102 fp_write=getattr(self.fp, 'write', sys.stderr.write))
1103
-> 1104 for obj in iterable:
1105 yield obj
1106 # Update and possibly print the progressbar.
/usr/local/lib/python3.7/dist-packages/nlp/datasets/csv/ede98314803c971fef04bcee45d660c62f3332e8a74491e0b876106f3d99bd9b/csv.py in _generate_tables(self, files)
78 read_options=self.config.pa_read_options,
79 parse_options=self.config.pa_parse_options,
---> 80 convert_options=self.config.convert_options,
81 )
82 yield i, pa_table
/usr/local/lib/python3.7/dist-packages/pyarrow/_csv.pyx in pyarrow._csv.read_csv()
/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: CSV parse error: Expected 1 columns, got 8
Version:
nlp==0.4.0
Hi @kauvinlucas
You can use the latest versions of datasets
to do this.
To do so, just pip install datasets
instead of nlp
(the library was renamed) and then
from datasets import load_dataset
dataset = load_dataset('csv', data_files='sample_data.csv')
Hi I'm having a different problem with loading local csv.
from datasets import load_dataset
dataset = load_dataset('csv', data_files='sample.csv')
gives ValueError: Specified named and prefix; you can only specify one.
error
versions:
Oh.. I figured it out. According to issue #42387 from pandas, this new version does not accept None for both parameters (which was being done by the repo I'm testing). Dowgrading Pandas==1.0.4 and Python==3.8 worked
Hi,
I got an OSError: Cannot find data file.
when I tried to use load_dataset with tsv files. I have checked the paths, and they are correct.
versions
data_files = {"train": "train.tsv", "test",: "test.tsv"}
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
The entire Error message is on below:
08/14/2021 16:55:44 - INFO - __main__ - load a local file for test: /project/media-framing/transformer4/data/unlabel/test.tsv
Using custom data configuration default
Downloading and preparing dataset csv/default-00a4200ae8507533 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /usr4/cs542sp/hey1/.cache/huggingface/datasets/csv/default-00a4200ae8507533/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2...
Traceback (most recent call last):
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 592, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 944, in _prepare_split
num_examples, num_bytes = writer.finalize()
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/arrow_writer.py", line 307, in finalize
self.stream.close()
File "pyarrow/io.pxi", line 132, in pyarrow.lib.NativeFile.close
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: error closing file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_glue.py", line 484, in <module>
main()
File "run_glue.py", line 243, in main
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/load.py", line 610, in load_dataset
ignore_verifications=ignore_verifications,
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 515, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 594, in _download_and_prepare
raise OSError("Cannot find data file. " + (self.manual_download_instructions or ""))
OSError: Cannot find data file. ```
Hi ! It looks like the error stacktrace doesn't match with your code snippet.
What error do you get when running this ?
data_files = {"train": "train.tsv", "test",: "test.tsv"}
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
can you check that both tsv files are in the same folder as the current working directory of your shell ?
Hi @lhoestq, Below is the entire error message after I move both tsv files to the same directory. It's the same with I got before.
/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
08/29/2021 22:56:43 - WARNING - __main__ - Process rank: -1, device: cpu, n_gpu: 0distributed training: False, 16-bits training: False
08/29/2021 22:56:43 - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir=/projectnb/media-framing/pred_result/label1/, overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=True, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=8.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Aug29_22-56-43_scc1, logging_first_step=False, logging_steps=500, save_steps=3000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=/projectnb/media-framing/pred_result/label1/, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, _n_gpu=0)
08/29/2021 22:56:43 - INFO - __main__ - load a local file for train: /project/media-framing/transformer4/temp_train.tsv
08/29/2021 22:56:43 - INFO - __main__ - load a local file for test: /project/media-framing/transformer4/temp_test.tsv
08/29/2021 22:56:43 - WARNING - datasets.builder - Using custom data configuration default-df627c23ac0e98ec
Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /usr4/cs542sp/hey1/.cache/huggingface/datasets/csv/default-df627c23ac0e98ec/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...
Traceback (most recent call last):
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 693, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 1166, in _prepare_split
num_examples, num_bytes = writer.finalize()
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/arrow_writer.py", line 428, in finalize
self.stream.close()
File "pyarrow/io.pxi", line 132, in pyarrow.lib.NativeFile.close
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: error closing file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run_glue.py", line 487, in <module>
main()
File "run_glue.py", line 244, in main
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/load.py", line 852, in load_dataset
use_auth_token=use_auth_token,
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 616, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 699, in _download_and_prepare
+ str(e)
OSError: Cannot find data file.
Original error:
error closing file
Hi ! Can you try running this into a python shell directly ?
import os
from datasets import load_dataset
data_files = {"train": "train.tsv", "test": "test.tsv"}
assert all(os.path.isfile(data_file) for data_file in data_files.values()), "Couln't find files"
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
print("success !")
This way all the code from run_glue.py
doesn't interfere with our tests :)
Hi @lhoestq,
Below is what I got from terminal after I copied and run your code. I think the files themselves are good since there is no assertion error.
Using custom data configuration default-df627c23ac0e98ec
Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /usr4/cs542sp/hey1/.cache/huggingface/datasets/csv/default-df627c23ac0e98ec/0.0.0/9144e0a4e8435090117cea53e6c7537173ef2304525df4a077c435d8ee7828ff...
Traceback (most recent call last):
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 693, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 1166, in _prepare_split
num_examples, num_bytes = writer.finalize()
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/arrow_writer.py", line 428, in finalize
self.stream.close()
File "pyarrow/io.pxi", line 132, in pyarrow.lib.NativeFile.close
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: error closing file
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 7, in <module>
datasets = load_dataset("csv", data_files=data_files, delimiter="\t")
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/load.py", line 852, in load_dataset
use_auth_token=use_auth_token,
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 616, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/projectnb2/media-framing/env-trans4/lib/python3.7/site-packages/datasets/builder.py", line 699, in _download_and_prepare
+ str(e)
OSError: Cannot find data file.
Original error:
error closing file
Hi, could this be a permission error ? I think it fails to close the arrow file that contains the data from your CSVs in the cache.
By default datasets are cached in ~/.cache/huggingface/datasets
, could you check that you have the right permissions ?
You can also try to change the cache directory by passing cache_dir="path/to/my/cache/dir"
to load_dataset
.
Thank you!! @lhoestq
For some reason, I don't have the default path for datasets to cache, maybe because I work from a remote system. The issue solved after I pass the cache_dir
argument to the function. Thank you very much!!
Hi, could this be a permission error ? I think it fails to close the arrow file that contains the data from your CSVs in the cache.
By default datasets are cached in
~/.cache/huggingface/datasets
, could you check that you have the right permissions ? You can also try to change the cache directory by passingcache_dir="path/to/my/cache/dir"
toload_dataset
.
This is the exact solution I have been finding for the whole afternoon. Thanks a lot! I tried to do a training on a cluster computing system. The user's home directory is shared between nodes. It always gets stuck at dataset loading... The reason might be, the node (with GPU) can't read/write data in the default cache folder (in my home directory). After using an intermediate cache folder, this issue is resolved for me.
Similar to #622, I've noticed there is a problem when trying to load a CSV file with datasets.
from datasets import load_dataset
dataset = load_dataset("csv", data_files=["./sample_data.csv"], delimiter="\t", column_names=["title", "text"], script_version="master")
Displayed error:
... ArrowInvalid: CSV parse error: Expected 2 columns, got 1
I should mention that when I've tried to read data from
https://github.com/lhoestq/transformers/tree/custom-dataset-in-rag-retriever/examples/rag/test_data/my_knowledge_dataset.csv
it worked without a problem. I've read that there might be some problems with /r character, so I've removed them from the custom dataset, but the problem still remains.I've added a colab reproducing the bug, but unfortunately I cannot provide the dataset. https://colab.research.google.com/drive/1Qzu7sC-frZVeniiWOwzoCe_UHZsrlxu8?usp=sharing
Are there any work around for it ? Thank you