Add COCO datasets - Githubissues

NielsRogge commented 3 years ago

Adding a Dataset

Name: COCO
Description: COCO is a large-scale object detection, segmentation, and captioning dataset.
Paper + website: https://cocodataset.org/#home
Data: https://cocodataset.org/#download
Motivation: It would be great to have COCO available in HuggingFace datasets, as we are moving beyond just text. COCO includes multi-modalities (images + text), as well as a huge amount of images annotated with objects, segmentation masks, keypoints etc., on which models like DETR (which I recently added to HuggingFace Transformers) are trained. Currently, one needs to download everything from the website and place it in a local folder, but it would be much easier if we can directly access it through the datasets API.

Instructions to add a new dataset can be found here.

merveenoyan commented 2 years ago

I'm currently adding it, the entire dataset is quite big around 30 GB so I add splits separately. You can take a look here https://huggingface.co/datasets/merve/coco

merveenoyan commented 2 years ago

I talked to @lhoestq and it's best if I download this dataset through TensorFlow datasets instead, so I'll be implementing that one really soon. @NielsRogge

merveenoyan commented 2 years ago

I started adding COCO, will be done tomorrow EOD my work so far https://github.com/merveenoyan/datasets (my fork)

ryx19th commented 2 years ago

Hi Merve @merveenoyan , thank you so much for your great contribution! May I ask about the current progress of your implementation? Cuz I see the pull request is still in progess here. Or can I just run the COCO scripts in your fork repo?

merveenoyan commented 2 years ago

Hello @yixuanren I had another prioritized project about to be merged, but I'll start continuing today will finish up soon.

ryx19th commented 2 years ago

Hello @yixuanren I had another prioritized project about to be merged, but I'll start continuing today will finish up soon.

It's really nice of you!! I see you've commited another version just now

merveenoyan commented 2 years ago

@yixuanren we're working on it, will be available soon, thanks a lot for your patience

lombardata commented 1 year ago

Hi @NielsRogge and @merveenoyan, did you find a way to load a dataset with COCO annotations to HF's hub? I have a panoptic segmentation dataset in COCO format and would like to share it with the community. Thanks in advance :)

lhoestq commented 1 year ago

The COCO format is not supported out of the box in the HF's hub - you'd need to reformat it to an ImageFolder with metadata format, or write a loading script

lombardata commented 1 year ago

The COCO format is not supported out of the box in the HF's hub - you'd need to reformat it to an ImageFolder with metadata format, or write a loading script

Hi @lhoestq , thank you for your quick reply. I've correctly created a metadata.jsonl file for a dataset with instance segmentation annotations here but do not understand how I can integrate panoptic annotations with the metadata format of ImageFolder datasets. The "problem" with panoptic annotations is that we have a folder with images, a json file with annotations and another folder with png annotations.

I checked between all the datasets already published on HuggingFace and, the only one who has uploaded a correct panoptic dataset is @NielsRogge here and here. Indeed he accomplished to have three fields : 1.image (image) 2.label (image) 3.segments_info (list) but I not find the corresponding code that allows to upload a panoptic dataset from this 3 sources. Can you please share an example code? Thanks !

lhoestq commented 1 year ago

Both were uploaded using ds.push_to_hub() :)

You can get a Dataset from a python dictionary using ds = Dataset.from_dict(...) and casts the paths to images to the Image() type using ds = ds.cast_column("image", Image()).

from datasets import Dataset, Image

ds = Dataset.from_dict(...)
ds = ds.cast_column("image", Image())
ds = ds.cast_column("label", Image())
ds.push_to_hub(...)

lombardata commented 1 year ago

Both were uploaded using ds.push_to_hub() :)

You can get a Dataset from a python dictionary using ds = Dataset.from_dict(...) and casts the paths to images to the Image() type using ds = ds.cast_column("image", Image()).
from datasets import Dataset, Image

ds = Dataset.from_dict(...)
ds = ds.cast_column("image", Image())
ds = ds.cast_column("label", Image())
ds.push_to_hub(...)

Thank you very much @lhoestq , I succesfully created a hf dataset here with the two fields : 1.image (image) 2.label (image) following your suggestions. Now still remain the problem of uploading segments_info information to the dataset. There is a function that easily imports the _panoptic_cocoannotation.json file to a segment_info field? I think we must define a list_of_segment, i.e. a list of lists of this type :

[ { "area": 214858, "bbox": [ 0, 0, 511, 760 ], "category_id": 0, "id": 7895160, "iscrowd": 0 }, { "area": 73067, "bbox": [ 98, 719, 413, 253 ], "category_id": 3, "id": 3289680, "iscrowd": 0 }, { "area": 832, "bbox": [ 53, 0, 101, 16 ], "category_id": 5, "id": 5273720, "iscrowd": 0 }, { "area": 70668, "bbox": [ 318, 60, 191, 392 ], "category_id": 8, "id": 15132390, "iscrowd": 0 }, { "area": 32696, "bbox": [ 0, 100, 78, 872 ], "category_id": 18, "id": 472063, "iscrowd": 0 }, { "area": 76045, "bbox": [ 42, 48, 264, 924 ], "category_id": 37, "id": 16713830, "iscrowd": 0 }, { "area": 27103, "bbox": [ 288, 482, 216, 306 ], "category_id": 47, "id": 16753408, "iscrowd": 0 } ]

and then apply again the cast_column function here but with a list as a second argument, like :

from datasets import Dataset, Image
ds = ds.cast_column("image", Image())
ds = ds.cast_column("label", Image())
ds = ds.cast_column("segments_info", list)

but I do not see how to transfer the information of the _panoptic_cocoannotation.json to a list of lists of this type :

[ { "area": 214858, "bbox": [ 0, 0, 511, 760 ], "category_id": 0, "id": 7895160, "iscrowd": 0 }, { "area": 73067, "bbox": [ 98, 719, 413, 253 ], "category_id": 3, "id": 3289680, "iscrowd": 0 }, { "area": 832, "bbox": [ 53, 0, 101, 16 ], "category_id": 5, "id": 5273720, "iscrowd": 0 }, { "area": 70668, "bbox": [ 318, 60, 191, 392 ], "category_id": 8, "id": 15132390, "iscrowd": 0 }, { "area": 32696, "bbox": [ 0, 100, 78, 872 ], "category_id": 18, "id": 472063, "iscrowd": 0 }, { "area": 76045, "bbox": [ 42, 48, 264, 924 ], "category_id": 37, "id": 16713830, "iscrowd": 0 }, { "area": 27103, "bbox": [ 288, 482, 216, 306 ], "category_id": 47, "id": 16753408, "iscrowd": 0 } ]

like @NielsRogge has done here and here. Thank you again for your help and have a good day !

lhoestq commented 1 year ago

You can pass this data in .from_dict() - no need to cast anything for this column

ds = Dataset.from_dict({
    "image": [...],
    "label": [...],
    "segments_info": [...],
)}

where segments_info is the list of the segment_infos of all the examples in the dataset, and therefore is a list of lists of dicts.

lombardata commented 1 year ago

You can pass this data in .from_dict() - no need to cast anything for this column
ds = Dataset.from_dict({
    "image": [...],
    "label": [...],
    "segments_info": [...],
)}
where segments_info is the list of the segment_infos of all the examples in the dataset, and therefore is a list of lists of dicts.

Thank you for the quick reply @lhoestq , but then how to generate the segments_info list of lists of dicts starting from a _panoptic_cocoannotation.json file ?

lhoestq commented 1 year ago

You read the JSON file and transform the data yourself. I don't think there's an automatic converter somewhere

lombardata commented 1 year ago

You read the JSON file and transform the data yourself. I don't think there's an automatic converter somewhere

Perfect, I've done it and succesfully uploaded a new dataset here, but I've (I hope) a last problem. The dataset has currently 302 images and, when I upload it to the hub, only the first page of images is correctly uploaded. When I try to see the second/third/fourth page of items of my dataset, I can see that the fields segments_info and image_name are correctly uploaded, while the images are not (the "null" string is shown everywhere).

I've checked the path of images that are not uploaded and they exists, is there a problem with the size of the dataset ? How can I upload the whole dataset to the hub ? Thank you again @lhoestq and have a good day !

lhoestq commented 1 year ago

Awesome ! Your dataset looks all good 🤗

The null in the viewer is a bug on our side, let me investigate

huggingface / datasets

Add COCO datasets #2526

Adding a Dataset