huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.97k stars 2.62k forks source link

Add CheXpert dataset for vision #6382

Open SauravMaheshkar opened 10 months ago

SauravMaheshkar commented 10 months ago

Feature request

Name

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

Paper

https://arxiv.org/abs/1901.07031

Data

https://stanfordaimi.azurewebsites.net/datasets/8cbd9ed4-2eb9-4565-affc-111cf4f7ebe2

Motivation

CheXpert is one of the fundamental models in medical image classification and can serve as a viable pre-training dataset for radiology classification or low-scale ablation / exploratory studies.

This could also serve as a good pre-training dataset for Kaggle competitions.

Your contribution

Would love to make a PR and pre-process / get this into 🤗

katielink commented 10 months ago

Hey @SauravMaheshkar ! Just responded to your email.

For transparency, copying part of my response here: I agree, it would be really great to have this and other BenchMD datasets easily accessible on the hub.

I think the main limiting factor is that the ChexPert dataset is currently hosted on the Stanford AIMI Shared Datasets website, with a license that does not permit redistribution IIRC. Thus, I believe we would need to create a dataset loading script that would check authentication with the Stanford AIMI site before downloading and extracting the data.

I've started a HF dataset repo here, in case you want to collaborate on writing up this loading script! I'm also happy to take a stab when I have some more time next week.

charchit7 commented 9 months ago

Hey @katielink I would love to try this out. Please guide me.

Lord-of-Bugs commented 8 months ago

Hi @katielink , I would also love to be on board and contribute to this loading script/project if it is still being developed. I'm interested because I personally would like to gain access to the CheXpert dataset and am facing some weird issues, so I'd like to sort it out for me, and potentially others. Please keep me updated and guide me on this as well!!!