google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://mediapipe.dev
Apache License 2.0
26.71k stars 5.08k forks source link

There is a plan to support any version of EfficientNet architecture for mediapipe custom model maker object detection solution? #5127

Closed gerardoeg closed 1 month ago

gerardoeg commented 7 months ago

I'm having some issues trying to identify small regions on a high resolution image when I train a custom model using Mediapipe model maker.

See how to create a custom model using mediapipe model maker. https://developers.google.com/mediapipe/solutions/customization/object_detector

Example for Android https://github.com/googlesamples/mediapipe/tree/main/examples/object_detection/android-jetpack-compose

From an small research I did, seems like other model architecture might help better on this scenarios, but not sure if this is the correct approach to follow, any suggestion?.

My use case is trying to identify small ID (identification card) regions like gender or the Id signature, the current model helps a lot to detect my Id bounding box, but it have some issues trying to identify the mentioned regions. https://github.com/google/mediapipe/blob/master/mediapipe/model_maker/python/vision/object_detector/model_spec.py#L109-L129

I have tried on another use case to create small images (cropping the document to generate smaller images) from the labeled data (for small regions), and pass those image on train, train goes quite well my custom model was able to be created, but on the inference time, my small regions couldn't be identified by the model.

Here an example of the document and the regions I'm tryin to detect, on my end2end tests, only the document region (blu label) was able to be detected success fully. image

Again, not sure if a new model might resolve this issue, I have also tried to use the model that accept a bigger image unsucessfully :/. MOBILENET_V2_I320 MOBILENET_MULTI_AVG_I384

kuaashish commented 7 months ago

Hi @joezoug,

Could you please have look into this issue? Thank you!!

gerardoeg commented 7 months ago

This is an example of the images used for training, same kind of images is being used on inference.

image
joezoug commented 6 months ago

Hi @gerardoeg,

Unfortunately we don't have plans to support EfficientNet anytime in the near future.

Your task seems more like an Optical Character Recognition(OCR) use case which requires more a more specialized, task specific setup than this general object detection training pipeline. Here are some related docs: https://cloud.google.com/use-cases/ocr, https://www.tensorflow.org/lite/examples/optical_character_recognition/overview.

Quick followup to one of your points from earlier:

I have tried on another use case to create small images (cropping the document to generate smaller images) from the labeled data (for small regions), and pass those image on train, train goes quite well my custom model was able to be created, but on the inference time, my small regions couldn't be identified by the model.

  • Did you pass in the same smaller images for inference? The example you show is a large image of the entire document. If you train the model on zoomed-in small images of the document, you should also pass in zoomed-in small images of the document during inference as well.

Overall, the MobileNet(and EfficientNet) models are designed and pretrained on object recognition (not text recognition) datasets such as ImageNet. So it is possible that these models may struggle with your task of text recognition.

gerardoeg commented 6 months ago

Hello! @joezoug thanks for reviewing my report, just as mentioned in my last comment I generate similar images for training and inference:

This is an example of the images used for training, same kind of images is being used on inference

The way how I create the images is pretty simple: For training Once I have the bounding box of the document (labeled), I divide it by 6 rows and 1 column and I take a region where I know I will find my region of interest and move the original label into the "smaller" image, for the title I get the first row and for the other regions I get the last row of the document.

For Inference. Once I have the bounding box of the document (by inference, this actually works pretty well congrats! <3), I do the same operation dividing the document into sections, and pass it into the model for inference with the hope to find the regions I'm interested on(here is where the inference fails, because I'm getting always the inference of the document).

Aboud OCR: I might be able to identify some of the regions of interest with pure OCR, actually I use ML Kit to do OCR after region identification, but the task I wanted solve with ML is just that those regions are present on my document not really interested about what is written on, but just something to help me to validate that those regions are present.

On an scenario where I successfully solve this task with only OCR, still having a region that might be hard to identify which is called "signature", the one sometimes will not have any visible character but just lines.

Question: Do you think that I should be able to do inference with the way I train and use the model then? because now, I can do inference, but only one object is returned (recognized) which is the document bounding box).

joezoug commented 6 months ago

Hi @gerardoeg,

Do you think that I should be able to do inference with the way I train and use the model then? because now, I can do inference, but only one object is returned (recognized) which is the document bounding box).

I can't say for certain if the Object Detector model can detect the header and signature regions. You can try running a quick sanity test by only training the model on labelled data for one specific region, like the signature region, and then see if it can perform well during inference to identify the signature regions. If not, then I think the task may not be a good fit for this object detector.

gerardoeg commented 6 months ago

@joezoug thanks for the suggestion, for sure I'll try it, I'll update this issue with my findings.

gerardoeg commented 6 months ago

@joezoug Hello again, I just wanted to add an update for this issue, I ran the suggested sanity test, to train a model only with one object I'm interested to identify, (signature in this case).

My data set is pretty small, train_data_size: 57 validation_data size:26

Do you think that data set size might be the issue?. Do you think that the size of the image might be the issue?, since it's not an image with a valid aspect ratio 4:3 used on the examples for inference (AspectRatio.RATIO_4_3 in camera settings).

Here is an example of the image used for train and for inference. image

acethespy commented 6 months ago

Hi @gerardoeg,

If you have good training metrics and poor validation metrics, then dataset size may be an issue. But I do also want to point out signature detection is a very difficult problem to solve in OCR, and this object detection pipeline is not well designed for that use case. It's possible that even after increasing dataset size you may not see the performance you hope for.

Some more insight: the object detection model is good at remembering and recognizing patterns of how objects appear. For example a basketball is often an orange colored sphere with black lines. But signatures can vary greatly between different examples so it can be difficult for the model to remember one or a couple shapes that can accurately describe all signatures.

github-actions[bot] commented 1 month ago

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 month ago

This issue was closed due to lack of activity after being marked stale for past 7 days.