open-spaced-repetition / srs-benchmark

A benchmark for spaced repetition schedulers/algorithms
https://github.com/open-spaced-repetition/fsrs4anki/wiki
62 stars 9 forks source link

Cannot download dataset from huggingface #87

Closed lars76 closed 6 months ago

lars76 commented 6 months ago
from datasets import load_dataset

raw_datasets = load_dataset("open-spaced-repetition/FSRS-Anki-20k")

produces the following error:

Generating train split: 720284748 examples [05:41, 2111906.74 examples/s]
Traceback (most recent call last):
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 2011, in _prepare_split_single
    writer.write_table(table)
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/arrow_writer.py", line 585, in write_table
    pa_table = table_cast(pa_table, self._schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/table.py", line 2295, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/table.py", line 2249, in cast_table_to_schema
    raise CastError(
datasets.table.CastError: Couldn't cast
card_id: null
review_th: null
delta_t: null
rating: null
__index_level_0__: null
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 780
to
{'card_id': Value(dtype='int64', id=None), 'review_th': Value(dtype='int64', id=None), 'delta_t': Value(dtype='int64', id=None), 'rating': Value(dtype='int64', id=None)}
because column names don't match

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nieradzik/anki/download.py", line 3, in <module>
    raw_datasets = load_dataset("open-spaced-repetition/FSRS-Anki-20k")
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1122, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 1882, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/nieradzik/.conda/envs/pytorch2.1.1/lib/python3.11/site-packages/datasets/builder.py", line 2013, in _prepare_split_single
    raise DatasetGenerationCastError.from_cast_error(
datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 1 new columns ({'__index_level_0__'})

This happened while the csv dataset builder was generating data using

hf://datasets/open-spaced-repetition/FSRS-Anki-20k/dataset/2/10054.csv (at revision 9440578f519d7113db474c284bba7828fcbeccaf)

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
L-M-Sherlock commented 6 months ago

Oops. I find that the 10054.csv is empty. I will remove it soon.

L-M-Sherlock commented 6 months ago

Fixed in https://huggingface.co/datasets/open-spaced-repetition/FSRS-Anki-20k/commit/82f31245cdfe986a147c541fbb2ef2fb18a7e692

lars76 commented 6 months ago

You also have to delete from the repository, otherwise the error still occurs https://huggingface.co/datasets/open-spaced-repetition/FSRS-Anki-20k/blob/main/dataset/2/10054.csv

L-M-Sherlock commented 6 months ago

Oops. My fault. I deleted a wrong file.