openvinotoolkit / datumaro

Dataset Management Framework, a Python library and a CLI tool to build, analyze and manage Computer Vision datasets.
https://openvinotoolkit.github.io/datumaro/
MIT License
502 stars 125 forks source link

Splitting test/train/val and representative datasets, and convert to tfrecords #1510

Closed therealpurplemana closed 1 month ago

therealpurplemana commented 1 month ago

Hi, thanks for your great project. I'm using it to export data from cvat.ai, manipulate, and re-export into Tensorflow format.

In my specific case, I'm combining homogenius datasets by adding sources to a project which I exported from cvat.ai (so I can prune out incompletely labeled datasets), then I run

!datum transform --project ./tfdata -t split -- -t detection \ --subset train:.7 --subset val:.15 --subset test:.15

After which, I run to export it: !datum project export -p ./tfdata --format tf_detection_api -o ./final-export-tf_detection_api-detection -- --save-media (and --save-masks for segmentation export)

This produces a new folder with subfolders with /annotations and /images organized into train/test/val.json and respectively in the /images folder nicely packaged as TFRecords. There's also oddly a default.tfrecord but it was pretty small so I just deleted it.

Now, I also need a 20% representative dataset from my original dataset -- how do I "undo" the splits in my project? Or am I thinking about this incorrectly?

Currently, I need to delete the project, recreate it, re-add my sources, re-split into 20/80%, and then export again, and copy over the TFRecord.

Curious if there's an easier way to do this either through CLI or Python.

jihyeonyi commented 1 month ago

Hi @therealpurplemana, thank you for your interest in our project. Datumaro offers a version control feature, but it requires commits of the project. Alternatively, you could combine all subsets into a single dataset and then re-split them as needed.

therealpurplemana commented 1 month ago

Thank you.