tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.32k stars 1.55k forks source link

[data request] figshare brain tumor dataset #5395

Open BirkhoffLee opened 7 months ago

BirkhoffLee commented 7 months ago

This brain tumor dataset containing 3064 T1-weighted contrast-inhanced images from 233 patients with three kinds of brain tumor: meningioma (708 slices), glioma (1426 slices), and pituitary tumor (930 slices). Due to the file size limit of repository, we split the whole dataset into 4 subsets, and achive them in 4 .zip files with each .zip file containing 766 slices.The 5-fold cross-validation indices are also provided.


This data is organized in matlab data format (.mat file). Each file stores a struct containing the following fields for an image:

cjdata.label: 1 for meningioma, 2 for glioma, 3 for pituitary tumor cjdata.PID: patient ID cjdata.image: image data cjdata.tumorBorder: a vector storing the coordinates of discrete points on tumor border. For example, [x1, y1, x2, y2,...] in which x1, y1 are planar coordinates on tumor border. It was generated by manually delineating the tumor border. So we can use it to generate binary image of tumor mask. cjdata.tumorMask: a binary image with 1s indicating tumor region


This data was used in the following paper:

  1. Cheng, Jun, et al. "Enhanced Performance of Brain Tumor Classification via Tumor Region Augmentation and Partition." PloS one 10.10 (2015).
  2. Cheng, Jun, et al. "Retrieval of Brain Tumors by Adaptive Spatial Pooling and Fisher Vector Representation." PloS one 11.6 (2016). Matlab source codes are available on github https://github.com/chengjun583/brainTumorRetrieval

Jun Cheng School of Biomedical Engineering Southern Medical University, Guangzhou, China Email: chengjun583@qq.com

Folks who would also like to see this dataset in tensorflow/datasets, please thumbs-up so the developers can know which requests to prioritize.

And if you'd like to contribute the dataset (thank you!), see our guide to adding a dataset.

BirkhoffLee commented 7 months ago

Here's the python code shared by someone on Kaggle that transforms the raw .mat files into numpy arrays of brain tumor MRI images: https://www.kaggle.com/code/tasni18/brain-tumor-classification

ccl-core commented 6 months ago

Hello @BirkhoffLee and thank you for raising this issue!

Are you planning to add this dataset to TFDS yourself? If yes, you can follow this guide to adding a dataset.

As an example, you can refer to this recent commit that introduced the Databricks Dolly dataset.

BirkhoffLee commented 6 months ago

Hello @BirkhoffLee and thank you for raising this issue!

Are you planning to add this dataset to TFDS yourself? If yes, you can follow this guide to adding a dataset.

As an example, you can refer to this recent commit that introduced the Databricks Dolly dataset.

I'd love to, but I have a few questions:

  1. Removal of some data. I currently use the dataset on an image classification research project. The original dataset was published with MATLAB format. I have extracted the images as .PNG files (i.e.: removing some data in the orig dataset). Can I keep it as-is in the TFDS repo? To be more specific, only retaining cjdata.label and cjdata.image.
  2. Training split. The original dataset does not split the data for training and testing. How am I supposed to handle it in this repo?
  3. Hosting. Does the TFDS / Tensorflow project offer any place to store the dataset files? I do not see other datasets hosted here.

I have another dataset that I wish to be added into this repo: https://www.kaggle.com/datasets/sartajbhuvaji/brain-tumor-classification-mri. If I can have guidelines clarified then I'd be able to add it as well.

I'm new to the sector and apologies for any naive questions that I may have above, however I do wish to contribute to this repo because it makes research a lot easier. Much thanks :-)

mostafamohamedcx8 commented 2 months ago

Hello @BirkhoffLee Were you able to combine the two data together to work on them together or not?