ibrahimethemhamamci / CT-CLIP

Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography
187 stars 18 forks source link

Issue with loading the dataset with huggingface #9

Closed jaehwana2z closed 6 months ago

jaehwana2z commented 6 months ago

I get the following error after running the command:

load_dataset("ibrahimhamamci/CT-RATE")

Generating train split: 47149 examples [00:00, 135849.06 examples/s] Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 1989, in _prepare_split_single writer.write_table(table) File "/opt/conda/lib/python3.10/site-packages/datasets/arrow_writer.py", line 584, in write_table pa_table = table_cast(pa_table, self._schema) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 2240, in table_cast return cast_table_to_schema(table, schema) File "/opt/conda/lib/python3.10/site-packages/datasets/table.py", line 2194, in cast_table_to_schema raise CastError( datasets.table.CastError: Couldn't cast VolumeName: string Medical material: int64 Arterial wall calcification: int64 Cardiomegaly: int64 Pericardial effusion: int64 Coronary artery wall calcification: int64 Hiatal hernia: int64 Lymphadenopathy: int64 Emphysema: int64 Atelectasis: int64 Lung nodule: int64 Lung opacity: int64 Pulmonary fibrotic sequela: int64 Pleural effusion: int64 Mosaic attenuation pattern: int64 Peribronchial thickening: int64 Consolidation: int64 Bronchiectasis: int64 Interlobular septal thickening: int64 -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 2787 to {'VolumeName': Value(dtype='string', id=None), 'Manufacturer': Value(dtype='string', id=None), 'SeriesDescription': Value(dtype='string', id=None), 'ManufacturerModelName': Value(dtype='string', id=None), 'PatientSex': Value(dtype='string', id=None), 'PatientAge': Value(dtype='string', id=None), 'ReconstructionDiameter': Value(dtype='float64', id=None), 'DistanceSourceToDetector': Value(dtype='float64', id=None), 'DistanceSourceToPatient': Value(dtype='float64', id=None), 'GantryDetectorTilt': Value(dtype='int64', id=None), 'TableHeight': Value(dtype='float64', id=None), 'RotationDirection': Value(dtype='string', id=None), 'ExposureTime': Value(dtype='float64', id=None), 'XRayTubeCurrent': Value(dtype='int64', id=None), 'Exposure': Value(dtype='int64', id=None), 'FilterType': Value(dtype='string', id=None), 'GeneratorPower': Value(dtype='float64', id=None), 'FocalSpots': Value(dtype='string', id=None), 'ConvolutionKernel': Value(dtype='string', id=None), 'PatientPosition': Value(dtype='string', id=None), 'RevolutionTime': Value(dtype='float64', id=None), 'SingleCollimationWidth': Value(dtype='float64', id=None), 'TotalCollimationWidth': Value(dtype='float64', id=None), 'TableSpeed': Value(dtype='float64', id=None), 'TableFeedPerRotation': Value(dtype='float64', id=None), 'SpiralPitchFactor': Value(dtype='float64', id=None), 'DataCollectionCenterPatient': Value(dtype='string', id=None), 'ReconstructionTargetCenterPatient': Value(dtype='string', id=None), 'ExposureModulationType': Value(dtype='string', id=None), 'CTDIvol': Value(dtype='float64', id=None), 'ImagePositionPatient': Value(dtype='string', id=None), 'ImageOrientationPatient': Value(dtype='string', id=None), 'SliceLocation': Value(dtype='float64', id=None), 'SamplesPerPixel': Value(dtype='int64', id=None), 'PhotometricInterpretation': Value(dtype='string', id=None), 'Rows': Value(dtype='int64', id=None), 'Columns': Value(dtype='int64', id=None), 'XYSpacing': Value(dtype='string', id=None), 'RescaleIntercept': Value(dtype='int64', id=None), 'RescaleSlope': Value(dtype='int64', id=None), 'RescaleType': Value(dtype='string', id=None), 'NumberofSlices': Value(dtype='int64', id=None), 'ZSpacing': Value(dtype='float64', id=None), 'StudyDate': Value(dtype='int64', id=None)} because column names don't match

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.10/site-packages/datasets/load.py", line 2582, in load_dataset builder_instance.download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare self._download_and_prepare( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/opt/conda/lib/python3.10/site-packages/datasets/builder.py", line 1991, in _prepare_split_single raise DatasetGenerationCastError.from_cast_error( datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 18 new columns (Pericardial effusion, Coronary artery wall calcification, Mosaic attenuation pattern, Medical material, Lung nodule, Bronchiectasis, Lung opacity, Hiatal hernia, Pleural effusion, Pulmonary fibrotic sequela, Interlobular septal thickening, Atelectasis, Cardiomegaly, Consolidation, Lymphadenopathy, Peribronchial thickening, Emphysema, Arterial wall calcification) and 43 missing columns (DataCollectionCenterPatient, ConvolutionKernel, Rows, CTDIvol, TableHeight, SeriesDescription, RotationDirection, RescaleType, TotalCollimationWidth, Columns, GantryDetectorTilt, TableSpeed, TableFeedPerRotation, SingleCollimationWidth, RevolutionTime, ImageOrientationPatient, ExposureModulationType, SliceLocation, PatientSex, PhotometricInterpretation, NumberofSlices, ManufacturerModelName, DistanceSourceToDetector, XRayTubeCurrent, ReconstructionTargetCenterPatient, DistanceSourceToPatient, RescaleSlope, ZSpacing, SamplesPerPixel, StudyDate, PatientAge, RescaleIntercept, Manufacturer, Exposure, FocalSpots, SpiralPitchFactor, FilterType, ReconstructionDiameter, ExposureTime, GeneratorPower, XYSpacing, ImagePositionPatient, PatientPosition).

This happened while the csv dataset builder was generating data using

hf://datasets/ibrahimhamamci/CT-RATE/dataset/multi_abnormality_labels/train_predicted_labels.csv (at revision 4d92f6d4f805e36e2891359c04302705c314fe43)

sezginerr commented 6 months ago

Dear @jaehwana2z,

The same issue is discussed on Hugging Face: https://huggingface.co/datasets/ibrahimhamamci/CT-RATE/discussions/53 Please see the discussion thread for more information about this and other ways to download dataset. The problem should now be fixed for specific dataset configurations (labels, reports, or metadata). Please let me know if you still have issue with this!