Implement two-shot pipeline, where the first shot detects bounding boxes, and the second shot provides all classifications.
The first model will be YOLOv8n with low threshold.
The second model will be a custom CNN model, possibly implementing something like U-NET.
Justification
Currently, shape classification performance is high, but detection is low. By reducing number of classes in the first model, we should be able to improve the performance. Also, we can lower the threshold of the first model to increase recall.
Since the classification tasks are relatively overlapping—the color classifier has to somewhat segment the image at some point—we can combine all the classifiers into the same model, which should improve efficiency while allowing us to use a bigger model.
Description
Justification
TODOs
Related Issues