Open NielsRogge opened 3 years ago
I'm currently adding it, the entire dataset is quite big around 30 GB so I add splits separately. You can take a look here https://huggingface.co/datasets/merve/coco
I talked to @lhoestq and it's best if I download this dataset through TensorFlow datasets instead, so I'll be implementing that one really soon. @NielsRogge
I started adding COCO, will be done tomorrow EOD my work so far https://github.com/merveenoyan/datasets (my fork)
Hi Merve @merveenoyan , thank you so much for your great contribution! May I ask about the current progress of your implementation? Cuz I see the pull request is still in progess here. Or can I just run the COCO scripts in your fork repo?
Hello @yixuanren I had another prioritized project about to be merged, but I'll start continuing today will finish up soon.
Hello @yixuanren I had another prioritized project about to be merged, but I'll start continuing today will finish up soon.
It's really nice of you!! I see you've commited another version just now
@yixuanren we're working on it, will be available soon, thanks a lot for your patience
Hi @NielsRogge and @merveenoyan, did you find a way to load a dataset with COCO annotations to HF's hub? I have a panoptic segmentation dataset in COCO format and would like to share it with the community. Thanks in advance :)
The COCO format is not supported out of the box in the HF's hub - you'd need to reformat it to an ImageFolder with metadata format, or write a loading script
The COCO format is not supported out of the box in the HF's hub - you'd need to reformat it to an ImageFolder with metadata format, or write a loading script
Hi @lhoestq , thank you for your quick reply. I've correctly created a metadata.jsonl file for a dataset with instance segmentation annotations here but do not understand how I can integrate panoptic annotations with the metadata format of ImageFolder datasets. The "problem" with panoptic annotations is that we have a folder with images, a json file with annotations and another folder with png annotations.
I checked between all the datasets already published on HuggingFace and, the only one who has uploaded a correct panoptic dataset is @NielsRogge here and here. Indeed he accomplished to have three fields : 1.image (image) 2.label (image) 3.segments_info (list) but I not find the corresponding code that allows to upload a panoptic dataset from this 3 sources. Can you please share an example code? Thanks !
Both were uploaded using ds.push_to_hub()
:)
You can get a Dataset from a python dictionary using ds = Dataset.from_dict(...)
and casts the paths to images to the Image()
type using ds = ds.cast_column("image", Image())
.
from datasets import Dataset, Image
ds = Dataset.from_dict(...)
ds = ds.cast_column("image", Image())
ds = ds.cast_column("label", Image())
ds.push_to_hub(...)
Both were uploaded using
ds.push_to_hub()
:)You can get a Dataset from a python dictionary using
ds = Dataset.from_dict(...)
and casts the paths to images to theImage()
type usingds = ds.cast_column("image", Image())
.from datasets import Dataset, Image ds = Dataset.from_dict(...) ds = ds.cast_column("image", Image()) ds = ds.cast_column("label", Image()) ds.push_to_hub(...)
Thank you very much @lhoestq , I succesfully created a hf dataset here with the two fields : 1.image (image) 2.label (image) following your suggestions. Now still remain the problem of uploading segments_info information to the dataset. There is a function that easily imports the _panoptic_cocoannotation.json file to a segment_info field? I think we must define a list_of_segment, i.e. a list of lists of this type :
[ { "area": 214858, "bbox": [ 0, 0, 511, 760 ], "category_id": 0, "id": 7895160, "iscrowd": 0 }, { "area": 73067, "bbox": [ 98, 719, 413, 253 ], "category_id": 3, "id": 3289680, "iscrowd": 0 }, { "area": 832, "bbox": [ 53, 0, 101, 16 ], "category_id": 5, "id": 5273720, "iscrowd": 0 }, { "area": 70668, "bbox": [ 318, 60, 191, 392 ], "category_id": 8, "id": 15132390, "iscrowd": 0 }, { "area": 32696, "bbox": [ 0, 100, 78, 872 ], "category_id": 18, "id": 472063, "iscrowd": 0 }, { "area": 76045, "bbox": [ 42, 48, 264, 924 ], "category_id": 37, "id": 16713830, "iscrowd": 0 }, { "area": 27103, "bbox": [ 288, 482, 216, 306 ], "category_id": 47, "id": 16753408, "iscrowd": 0 } ]
and then apply again the cast_column function here but with a list as a second argument, like :
from datasets import Dataset, Image
ds = ds.cast_column("image", Image())
ds = ds.cast_column("label", Image())
ds = ds.cast_column("segments_info", list)
but I do not see how to transfer the information of the _panoptic_cocoannotation.json to a list of lists of this type :
[ { "area": 214858, "bbox": [ 0, 0, 511, 760 ], "category_id": 0, "id": 7895160, "iscrowd": 0 }, { "area": 73067, "bbox": [ 98, 719, 413, 253 ], "category_id": 3, "id": 3289680, "iscrowd": 0 }, { "area": 832, "bbox": [ 53, 0, 101, 16 ], "category_id": 5, "id": 5273720, "iscrowd": 0 }, { "area": 70668, "bbox": [ 318, 60, 191, 392 ], "category_id": 8, "id": 15132390, "iscrowd": 0 }, { "area": 32696, "bbox": [ 0, 100, 78, 872 ], "category_id": 18, "id": 472063, "iscrowd": 0 }, { "area": 76045, "bbox": [ 42, 48, 264, 924 ], "category_id": 37, "id": 16713830, "iscrowd": 0 }, { "area": 27103, "bbox": [ 288, 482, 216, 306 ], "category_id": 47, "id": 16753408, "iscrowd": 0 } ]
like @NielsRogge has done here and here. Thank you again for your help and have a good day !
You can pass this data in .from_dict() - no need to cast anything for this column
ds = Dataset.from_dict({
"image": [...],
"label": [...],
"segments_info": [...],
)}
where segments_info
is the list of the segment_infos of all the examples in the dataset, and therefore is a list of lists of dicts.
You can pass this data in .from_dict() - no need to cast anything for this column
ds = Dataset.from_dict({ "image": [...], "label": [...], "segments_info": [...], )}
where
segments_info
is the list of the segment_infos of all the examples in the dataset, and therefore is a list of lists of dicts.
Thank you for the quick reply @lhoestq , but then how to generate the segments_info
list of lists of dicts starting from a _panoptic_cocoannotation.json file ?
You read the JSON file and transform the data yourself. I don't think there's an automatic converter somewhere
You read the JSON file and transform the data yourself. I don't think there's an automatic converter somewhere
Perfect, I've done it and succesfully uploaded a new dataset here, but I've (I hope) a last problem. The dataset has currently 302 images and, when I upload it to the hub, only the first page of images is correctly uploaded. When I try to see the second/third/fourth page of items of my dataset, I can see that the fields segments_info and image_name are correctly uploaded, while the images are not (the "null" string is shown everywhere).
I've checked the path of images that are not uploaded and they exists, is there a problem with the size of the dataset ? How can I upload the whole dataset to the hub ? Thank you again @lhoestq and have a good day !
Awesome ! Your dataset looks all good 🤗
The null
in the viewer is a bug on our side, let me investigate
Adding a Dataset
Instructions to add a new dataset can be found here.