Closed therealpurplemana closed 6 months ago
Hi @therealpurplemana, thank you for your interest in our project. Datumaro offers a version control feature, but it requires commits of the project. Alternatively, you could combine all subsets into a single dataset and then re-split them as needed.
Thank you.
Hi, thanks for your great project. I'm using it to export data from cvat.ai, manipulate, and re-export into Tensorflow format.
In my specific case, I'm combining homogenius datasets by adding sources to a project which I exported from cvat.ai (so I can prune out incompletely labeled datasets), then I run
!datum transform --project ./tfdata -t split -- -t detection \ --subset train:.7 --subset val:.15 --subset test:.15
After which, I run to export it:
!datum project export -p ./tfdata --format tf_detection_api -o ./final-export-tf_detection_api-detection -- --save-media
(and --save-masks for segmentation export)This produces a new folder with subfolders with /annotations and /images organized into train/test/val.json and respectively in the /images folder nicely packaged as TFRecords. There's also oddly a default.tfrecord but it was pretty small so I just deleted it.
Now, I also need a 20% representative dataset from my original dataset -- how do I "undo" the splits in my project? Or am I thinking about this incorrectly?
Currently, I need to delete the project, recreate it, re-add my sources, re-split into 20/80%, and then export again, and copy over the TFRecord.
Curious if there's an easier way to do this either through CLI or Python.