"Killed" during dataset convert from COCO to YOLO format

DovydasPociusDroneTeam commented 1 year ago

Search before asking

[X] I have searched the Supervision issues and found no similar bug report.

Bug

getting "Killed" error while converting dataset from coco to yolo (the code is given bellow):

Screenshot from 2023-08-04 16-02-16

i tried to split manually big dataset in smaller parts (3 parts) and then didn't get error, but in .YAML file i got different classes positions in "names" part

names: [truck, car, medium car, bus, motorcycle]

and

names: [bus, car, medium car, truck, motorcycle]

any suggestions? Thank you in advance!

Environment

python 3.9.13
Ubuntu 20.04
supervision 0.12.0

Minimal Reproducible Example

import supervision as sv

ds = sv.DetectionDataset.from_coco(
    images_directory_path='/home/droneteam/detectron2_for_labeling/codes_for_testing_seg_model/500_datasetas_pirmam_mokymui/all_dataset',
    annotations_path='/home/droneteam/detectron2_for_labeling/codes_for_testing_seg_model/500_datasetas_pirmam_mokymui/all_dataset.json',
    force_masks=True
)

train_ds, test_ds = ds.split(split_ratio=0.8, random_state=42, shuffle=True)

train_ds.as_yolo(
    images_directory_path='/home/droneteam/detectron2_for_labeling/codes_for_testing_seg_model/500_datasetas_pirmam_mokymui/all_yolo/train/images',
    annotations_directory_path='/home/droneteam/detectron2_for_labeling/codes_for_testing_seg_model/500_datasetas_pirmam_mokymui/all_yolo/train_/labels',
    data_yaml_path='/home/droneteam/detectron2_for_labeling/codes_for_testing_seg_model/500_datasetas_pirmam_mokymui/train_copy_parts/data_train.yaml'

)

test_ds.as_yolo(
    images_directory_path='/home/droneteam/detectron2_for_labeling/codes_for_testing_seg_model/500_datasetas_pirmam_mokymui/all_yolo/test/images',
    annotations_directory_path='/home/droneteam/detectron2_for_labeling/codes_for_testing_seg_model/500_datasetas_pirmam_mokymui/all_yolo/test/labels',
    data_yaml_path='/home/droneteam/detectron2_for_labeling/codes_for_testing_seg_model/500_datasetas_pirmam_mokymui/train_copy_parts/data_test.yaml'

)

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

github-actions[bot] commented 1 year ago

Hello there, thank you for opening an Issue ! 🙏🏻 The team was notified and they will get back to you asap.

capjamesg commented 1 year ago

How many images are in your dataset? How large are the images?

This error is the Out of Memory (OOM) killer on your machine acting to ensure the Python process doesn't take up too much RAM and cause instability on your system. This suggests your system isn't able to store all of the images in your dataset in memory, which is required to convert the datasets.

DovydasPociusDroneTeam commented 1 year ago

i have about 6000 images (1024x1024). So if it is OOM problem, any suggestions how can i get around this problem?

SkalskiP commented 1 year ago

Hi, @DovydasPociusDroneTeam 👋🏻 This is interesting. So the script died, but the output datasets got saved anyway? Would love to learn more.

hardikdava commented 1 year ago

Hi @DovydasPociusDroneTeam 👋 , we can dig deeper into the process to check for memory leakage but that will take some time.

But answer to another quesiton about class names are rearranged, I might have an idea. @SkalskiP this is due to sorting class based on alphabetic order check here.

SkalskiP commented 1 year ago

@hardikdava I'm not sure. We got the input dataset. We divided that dataset into two parts. Saved both parts in YOLO format. Both Output datasets have different class orders. Do I understand the problem correctly?

Is the order different between input and output datasets? Or between both output datasets?

@hardikdava It is somehow a related topic. I think in the future, we should migrate sv.DetectionDataset.classes to be the Dict[int, str], not List[str]. We get more and more trouble with braking indexes.

hardikdava commented 1 year ago

@SkalskiP The changing of sv.DetectionDataset.classes into Dict[int, str] was already in my mind. We should definately do it.

DovydasPociusDroneTeam commented 1 year ago

Hi, @DovydasPociusDroneTeam 👋🏻 This is interesting. So the script died, but the output datasets got saved anyway? Would love to learn more.

i didn't get output from one 6000 images dataset.

So i tried this dataset split in to 3 separates datasets: instead of having

train_dataset_images (6000 images) 
└ ├ coco.json    
  ├ image1
  ├ image2
  ├ image2
  └ imageN

i did

train_dataset_images_part1 (2000 images)     
└  ├ coco_part1.json    
   ├ image1
   ├ image2
   ├ image3
   └ imageM

train_dataset_images_part2 (2000 images)     
└   ├ coco_part2.json    
    ├ imageM+1
    ├ imageM+2
    ├ imageM+3
    └ imageN

train_dataset_images_part3 (2000 images)     
└  ├ coco_part3.json    
   ├ imageN+1
   ├ imageN+2
   ├ imageN+3
   └ imageZ

and for every separate dataset with 2000 images i ran script from_coco().as_yolo() and and i was able to get results without error, but then i checked every output yaml file and saw "names" array was not same.

SkalskiP commented 1 year ago

@DovydasPociusDroneTeam, thanks a lot for helping us to understand what's happening. Could you help us a bit more and check the categories key in coco_part1.json, coco_part2.json, and coco_part3.json.

Please paste categories for each JSON here. If categories are precisely the same in each JSON, then we have a problem.

DovydasPociusDroneTeam commented 1 year ago

@DovydasPociusDroneTeam, thanks a lot for helping us to understand what's happening. Could you help us a bit more and check the categories key in coco_part1.json, coco_part2.json, and coco_part3.json.

Please paste categories for each JSON here. If categories are precisely the same in each JSON, then we have a problem.

You are right! In my coco_part1.json and coco_part2.json categories are not in the same sequence!

Okeyy... So i used bad converter from LABELME to COCO (when splitted dataset to 3 separates), don't know why it mixed categories sequence..

Thank you for that info! Looking forward to converting the full dataset without needing to split it into separate parts!

SkalskiP commented 1 year ago

@DovydasPociusDroneTeam 🔥 Awesome that we managed to get to the bottom of this problem.

Looking forward to converting the full dataset without needing to split it into separate parts!

We will need to introduce lazy loading of images to make that happen. It is on our roadmap. I'll pin this issue there to keep track of that problem.

I'll close the issue for now.

Killua7362 commented 1 year ago

@SkalskiP Can't we just save the dataset in pandas dataset then retrieve it batch by batch?

SkalskiP commented 1 year ago

Hi @Killua7362 👋🏻 Could you elaborate?

Killua7362 commented 1 year ago

Hello @SkalskiP hope you are well I am new to this community So I might be wrong here and If I want to create a dataset of images using roboflow will it save a generator object or whole dataset?

SkalskiP commented 1 year ago

Hi @Killua7362 👋🏻 No worries. I'm happy to explain. For now, you will always load the whole dataset, but we are thinking about adding a generator option.

Killua7362 commented 1 year ago

Hi @Killua7362 👋🏻 No worries. I'm happy to explain. For now, you will always load the whole dataset, but we are thinking about adding a generator option.

Can I try adding that option if you don't mind? @SkalskiP

SkalskiP commented 1 year ago

Hi @Killua7362, there is already the issue and PR, but I didn't have time to review it yet.

lonngxiang commented 11 months ago

@SkalskiP i use this datasets with 1 label https://universe.roboflow.com/naumov-igor-segmentation/car-segmetarion：

but i use this script coco2yolo，but i got 2 labeles

import supervision as sv

sv.DetectionDataset.from_coco(
    images_directory_path= r"C:\Users\loong\Downloads\Car\valid",
    annotations_path=r"C:\Users\loong\Downloads\Car\valid\_annotations.coco.json",
    force_masks=True
).as_yolo(
    images_directory_path=r"C:\Users\loong\Downloads\Car_yolo\val\images",
    annotations_directory_path=r"C:\Users\loong\Downloads\Car_yolo\val\labels",
    data_yaml_path=r"C:\Users\loong\Downloads\Car_yolo\data.yaml"
)

and the generated format doesn't seem right either

SkalskiP commented 11 months ago

Hi @lonngxiang 👋🏻 I'm happy to help out. I just responded to your issue. Let's move the conversation there.

roboflow / supervision