Closed KaleabTessera closed 1 year ago
Hi, thanks for the repo!
I can't seem to load the USMLE datasets used in eval.
2 of three links provided in the readme don't seem to work (USMLE Self Assessment Step 2, USMLE Self Assessment Step 3).
The first link works, but fails when trying to load from hugging face as follows:
from datasets import load_dataset part1 = load_dataset('medalpaca/medical_meadow_usmle_self_assessment',split='train')
Error:
Downloading and preparing dataset json/medalpaca--medical_meadow_usmle_self_assessment to /root/.cache/huggingface/datasets/medalpaca___json/medalpaca--medical_meadow_usmle_self_assessment-333492f3a84c0741/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4... Downloading data files: 100% 1/1 [00:00<00:00, 63.10it/s] Extracting data files: 100% 1/1 [00:00<00:00, 41.63it/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:1873 in _prepare_split_single │ │ │ │ 1870 │ │ │ │ │ │ │ storage_options=self._fs.storage_options, │ │ 1871 │ │ │ │ │ │ │ embed_local_files=embed_local_files, │ │ 1872 │ │ │ │ │ │ ) │ │ ❱ 1873 │ │ │ │ │ writer.write_table(table) │ │ 1874 │ │ │ │ │ num_examples_progress_update += len(table) │ │ 1875 │ │ │ │ │ if time.time() > _time + config.PBAR_REFRESH_TIME_INTERVAL: │ │ 1876 │ │ │ │ │ │ _time = time.time() │ │ │ │ /usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py:568 in write_table │ │ │ │ 565 │ │ if self.pa_writer is None: │ │ 566 │ │ │ self._build_writer(inferred_schema=pa_table.schema) │ │ 567 │ │ pa_table = pa_table.combine_chunks() │ │ ❱ 568 │ │ pa_table = table_cast(pa_table, self._schema) │ │ 569 │ │ if self.embed_local_files: │ │ 570 │ │ │ pa_table = embed_table_storage(pa_table) │ │ 571 │ │ self._num_bytes += pa_table.nbytes │ │ │ │ /usr/local/lib/python3.10/dist-packages/datasets/table.py:2290 in table_cast │ │ │ │ 2287 │ │ table (`pyarrow.Table`): the casted table │ │ 2288 │ """ │ │ 2289 │ if table.schema != schema: │ │ ❱ 2290 │ │ return cast_table_to_schema(table, schema) │ │ 2291 │ elif table.schema.metadata != schema.metadata: │ │ 2292 │ │ return table.replace_schema_metadata(schema.metadata) │ │ 2293 │ else: │ │ │ │ /usr/local/lib/python3.10/dist-packages/datasets/table.py:2248 in cast_table_to_schema │ │ │ │ 2245 │ │ │ 2246 │ features = Features.from_arrow_schema(schema) │ │ 2247 │ if sorted(table.column_names) != sorted(features): │ │ ❱ 2248 │ │ raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column │ │ 2249 │ arrays = [cast_array_to_feature(table[name], feature) for name, feature in features. │ │ 2250 │ return pa.Table.from_arrays(arrays, schema=schema) │ │ 2251 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Couldn't cast image_url: string no: int64 options: struct<A: string, B: string, C: string, D: string, E: string, F: string, G: string, H: string, I: string> child 0, A: string child 1, B: string child 2, C: string child 3, D: string child 4, E: string child 5, F: string child 6, G: string child 7, H: string child 8, I: string question: string image: string to {'step1': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'step2': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), 'step3': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)} because column names don't match The above exception was the direct cause of the following exception: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <cell line: 2>:2 │ │ │ │ /usr/local/lib/python3.10/dist-packages/datasets/load.py:1797 in load_dataset │ │ │ │ 1794 │ try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES │ │ 1795 │ │ │ 1796 │ # Download and prepare data │ │ ❱ 1797 │ builder_instance.download_and_prepare( │ │ 1798 │ │ download_config=download_config, │ │ 1799 │ │ download_mode=download_mode, │ │ 1800 │ │ verification_mode=verification_mode, │ │ │ │ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:890 in download_and_prepare │ │ │ │ 887 │ │ │ │ │ │ │ prepare_split_kwargs["max_shard_size"] = max_shard_size │ │ 888 │ │ │ │ │ │ if num_proc is not None: │ │ 889 │ │ │ │ │ │ │ prepare_split_kwargs["num_proc"] = num_proc │ │ ❱ 890 │ │ │ │ │ │ self._download_and_prepare( │ │ 891 │ │ │ │ │ │ │ dl_manager=dl_manager, │ │ 892 │ │ │ │ │ │ │ verification_mode=verification_mode, │ │ 893 │ │ │ │ │ │ │ **prepare_split_kwargs, │ │ │ │ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:985 in _download_and_prepare │ │ │ │ 982 │ │ │ │ │ 983 │ │ │ try: │ │ 984 │ │ │ │ # Prepare split will record examples associated to the split │ │ ❱ 985 │ │ │ │ self._prepare_split(split_generator, **prepare_split_kwargs) │ │ 986 │ │ │ except OSError as e: │ │ 987 │ │ │ │ raise OSError( │ │ 988 │ │ │ │ │ "Cannot find data file. " │ │ │ │ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:1746 in _prepare_split │ │ │ │ 1743 │ │ │ gen_kwargs = split_generator.gen_kwargs │ │ 1744 │ │ │ job_id = 0 │ │ 1745 │ │ │ with pbar: │ │ ❱ 1746 │ │ │ │ for job_id, done, content in self._prepare_split_single( │ │ 1747 │ │ │ │ │ gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args │ │ 1748 │ │ │ │ ): │ │ 1749 │ │ │ │ │ if done: │ │ │ │ /usr/local/lib/python3.10/dist-packages/datasets/builder.py:1891 in _prepare_split_single │ │ │ │ 1888 │ │ │ # Ignore the writer's error for no examples written to the file if this erro │ │ 1889 │ │ │ if isinstance(e, SchemaInferenceError) and e.__context__ is not None: │ │ 1890 │ │ │ │ e = e.__context__ │ │ ❱ 1891 │ │ │ raise DatasetGenerationError("An error occurred while generating the dataset │ │ 1892 │ │ │ │ 1893 │ │ yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_ │ │ 1894 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ DatasetGenerationError: An error occurred while generating the dataset
Can you assist with an example of how download the USMLE datasets?
You are correct, I had typos in the link. Please note, that this dataset likely will not load correctly if you use the huggingface API, as all JSON files are stored in the same repo.
Hi, thanks for the repo!
I can't seem to load the USMLE datasets used in eval.
2 of three links provided in the readme don't seem to work (USMLE Self Assessment Step 2, USMLE Self Assessment Step 3).
The first link works, but fails when trying to load from hugging face as follows:
Error:
Can you assist with an example of how download the USMLE datasets?