kbressem / medAlpaca

LLM finetuned for medical question answering
GNU General Public License v3.0
491 stars 57 forks source link

Unable to Load USMLE datasets #29

Closed KaleabTessera closed 1 year ago

KaleabTessera commented 1 year ago

Hi, thanks for the repo!

I can't seem to load the USMLE datasets used in eval.

2 of three links provided in the readme don't seem to work (USMLE Self Assessment Step 2, USMLE Self Assessment Step 3).

The first link works, but fails when trying to load from hugging face as follows:

from datasets import load_dataset
part1 = load_dataset('medalpaca/medical_meadow_usmle_self_assessment',split='train')

Error:

Downloading and preparing dataset json/medalpaca--medical_meadow_usmle_self_assessment to /root/.cache/huggingface/datasets/medalpaca___json/medalpaca--medical_meadow_usmle_self_assessment-333492f3a84c0741/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...
Downloading data files: 100%
1/1 [00:00<00:00, 63.10it/s]
Extracting data files: 100%
1/1 [00:00<00:00, 41.63it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:1873 in _prepare_split_single        │
│                                                                                                  │
│   1870 │   │   │   │   │   │   │   storage_options=self._fs.storage_options,                     │
│   1871 │   │   │   │   │   │   │   embed_local_files=embed_local_files,                          │
│   1872 │   │   │   │   │   │   )                                                                 │
│ ❱ 1873 │   │   │   │   │   writer.write_table(table)                                             │
│   1874 │   │   │   │   │   num_examples_progress_update += len(table)                            │
│   1875 │   │   │   │   │   if time.time() > _time + config.PBAR_REFRESH_TIME_INTERVAL:           │
│   1876 │   │   │   │   │   │   _time = time.time()                                               │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py:568 in write_table              │
│                                                                                                  │
│   565 │   │   if self.pa_writer is None:                                                         │
│   566 │   │   │   self._build_writer(inferred_schema=pa_table.schema)                            │
│   567 │   │   pa_table = pa_table.combine_chunks()                                               │
│ ❱ 568 │   │   pa_table = table_cast(pa_table, self._schema)                                      │
│   569 │   │   if self.embed_local_files:                                                         │
│   570 │   │   │   pa_table = embed_table_storage(pa_table)                                       │
│   571 │   │   self._num_bytes += pa_table.nbytes                                                 │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/table.py:2290 in table_cast                     │
│                                                                                                  │
│   2287 │   │   table (`pyarrow.Table`): the casted table                                         │
│   2288 │   """                                                                                   │
│   2289 │   if table.schema != schema:                                                            │
│ ❱ 2290 │   │   return cast_table_to_schema(table, schema)                                        │
│   2291 │   elif table.schema.metadata != schema.metadata:                                        │
│   2292 │   │   return table.replace_schema_metadata(schema.metadata)                             │
│   2293 │   else:                                                                                 │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/table.py:2248 in cast_table_to_schema           │
│                                                                                                  │
│   2245 │                                                                                         │
│   2246 │   features = Features.from_arrow_schema(schema)                                         │
│   2247 │   if sorted(table.column_names) != sorted(features):                                    │
│ ❱ 2248 │   │   raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column  │
│   2249 │   arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.  │
│   2250 │   return pa.Table.from_arrays(arrays, schema=schema)                                    │
│   2251                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Couldn't cast
image_url: string
no: int64
options: struct<A: string, B: string, C: string, D: string, E: string, F: string, G: string, H: string, I: string>
  child 0, A: string
  child 1, B: string
  child 2, C: string
  child 3, D: string
  child 4, E: string
  child 5, F: string
  child 6, G: string
  child 7, H: string
  child 8, I: string
question: string
image: string
to
{'step1': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'step2': 
Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'step3': Sequence(feature=Value(dtype='int64',
id=None), length=-1, id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 2>:2                                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/load.py:1797 in load_dataset                    │
│                                                                                                  │
│   1794 │   try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES                              │
│   1795 │                                                                                         │
│   1796 │   # Download and prepare data                                                           │
│ ❱ 1797 │   builder_instance.download_and_prepare(                                                │
│   1798 │   │   download_config=download_config,                                                  │
│   1799 │   │   download_mode=download_mode,                                                      │
│   1800 │   │   verification_mode=verification_mode,                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:890 in download_and_prepare          │
│                                                                                                  │
│    887 │   │   │   │   │   │   │   prepare_split_kwargs["max_shard_size"] = max_shard_size       │
│    888 │   │   │   │   │   │   if num_proc is not None:                                          │
│    889 │   │   │   │   │   │   │   prepare_split_kwargs["num_proc"] = num_proc                   │
│ ❱  890 │   │   │   │   │   │   self._download_and_prepare(                                       │
│    891 │   │   │   │   │   │   │   dl_manager=dl_manager,                                        │
│    892 │   │   │   │   │   │   │   verification_mode=verification_mode,                          │
│    893 │   │   │   │   │   │   │   **prepare_split_kwargs,                                       │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:985 in _download_and_prepare         │
│                                                                                                  │
│    982 │   │   │                                                                                 │
│    983 │   │   │   try:                                                                          │
│    984 │   │   │   │   # Prepare split will record examples associated to the split              │
│ ❱  985 │   │   │   │   self._prepare_split(split_generator, **prepare_split_kwargs)              │
│    986 │   │   │   except OSError as e:                                                          │
│    987 │   │   │   │   raise OSError(                                                            │
│    988 │   │   │   │   │   "Cannot find data file. "                                             │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:1746 in _prepare_split               │
│                                                                                                  │
│   1743 │   │   │   gen_kwargs = split_generator.gen_kwargs                                       │
│   1744 │   │   │   job_id = 0                                                                    │
│   1745 │   │   │   with pbar:                                                                    │
│ ❱ 1746 │   │   │   │   for job_id, done, content in self._prepare_split_single(                  │
│   1747 │   │   │   │   │   gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args           │
│   1748 │   │   │   │   ):                                                                        │
│   1749 │   │   │   │   │   if done:                                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:1891 in _prepare_split_single        │
│                                                                                                  │
│   1888 │   │   │   # Ignore the writer's error for no examples written to the file if this erro  │
│   1889 │   │   │   if isinstance(e, SchemaInferenceError) and e.__context__ is not None:         │
│   1890 │   │   │   │   e = e.__context__                                                         │
│ ❱ 1891 │   │   │   raise DatasetGenerationError("An error occurred while generating the dataset  │
│   1892 │   │                                                                                     │
│   1893 │   │   yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_  │
│   1894                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
DatasetGenerationError: An error occurred while generating the dataset

Can you assist with an example of how download the USMLE datasets?

kbressem commented 1 year ago

You are correct, I had typos in the link. Please note, that this dataset likely will not load correctly if you use the huggingface API, as all JSON files are stored in the same repo.