PawelPeczek-Roboflow commented 1 week ago

Google Vision OCR in Workflows

Are you ready to make a meaningful contribution this Hacktoberfest? We are looking to integrate Google Vision OCR into our Workflows ecosystem! This new OCR block, will be a valuable addition, addressing a common challenge that many users face.

Join us in expanding our ecosystem and empowering users to effortlessly extract text and structure from their documents. Whether you’re a seasoned contributor or new to open source, your skills and ideas can help make this project a success. Let’s collaborate and bring this essential functionality to life!

🚧 Task description 🏗️

The task is to integrate OCR from Google Vision API into Workflows ecosystem
API should be adopted in a way that allow sending API key as Workflow input parameter, rather than using Google service account credentials - see Google Vision auth docs
We prefer light integration to REST API through requests library - 📖 REST API docs - in particular this may be useful - we do only want to enable TEXT_DETECTION and DOCUMENT_TEXT_DETECTION
output should be parsed into sv.Detections(...) object - recognised text should be label, additional metadata about structure (like category of region) should be added into data field of sv.Detections(...)
please raise any issues with the task in the discussion below

Cheatsheet

Contributor guide
Workflows docs
Creating Workflow block - tutorial
Workflow Kinds, in particular the kind wrapping sv.Detections(...) for object-detection predictions as a reference

Scaffolding for the block

💻 Code snippet

```python from typing import List, Literal, Optional, Type, Union from pydantic import ConfigDict import supervision as sv import requests from inference.core.workflows.execution_engine.entities.base import ( OutputDefinition, WorkflowImageData, ) from inference.core.workflows.execution_engine.entities.types import ( StepOutputImageSelector, WorkflowImageSelector, OBJECT_DETECTION_PREDICTION_KIND, ) from inference.core.workflows.prototypes.block import ( BlockResult, WorkflowBlock, WorkflowBlockManifest, ) class BlockManifest(WorkflowBlockManifest): model_config = ConfigDict( json_schema_extra={ "name": "Google Vision OCR", "version": "v1", "short_description": "TODO", "long_description": "TODO", "license": "Apache-2.0", "block_type": "model", }, protected_namespaces=(), ) type: Literal["roboflow_core/google_vision_ocr@v1"] image: Union[WorkflowImageSelector, StepOutputImageSelector] ocr_type: Literal["text_detection", "ocr_text_detection"] @classmethod def describe_outputs(cls) -> List[OutputDefinition]: return [ OutputDefinition( name="predictions", kind=[OBJECT_DETECTION_PREDICTION_KIND] ), ] @classmethod def get_execution_engine_compatibility(cls) -> Optional[str]: return ">=1.0.0,<2.0.0" class RoboflowObjectDetectionModelBlockV1(WorkflowBlock): @classmethod def get_manifest(cls) -> Type[WorkflowBlockManifest]: return BlockManifest def run( self, image: WorkflowImageData, ocr_type: Literal["text_detection", "ocr_text_detection"] ) -> BlockResult: results = requests.post(...) return { "predictions": sv.Detections(...) } ```

brunopicinin commented 1 week ago

I forked the project and started to develop a new block, but one thing is not clear to me.

Given the following image: https://testsigma.com/blog/wp-content/uploads/What-is-the-OCR-Test-How-to-Create-Automate-It.png

Passing this image to Google API as such:

POST https://vision.googleapis.com/v1/images:annotate?key=[YOUR_API_KEY] HTTP/1.1

Authorization: Bearer [YOUR_ACCESS_TOKEN]
Accept: application/json
Content-Type: application/json

{
  "requests": [
    {
      "image": {
        "source": {
          "imageUri": "https://testsigma.com/blog/wp-content/uploads/What-is-the-OCR-Test-How-to-Create-Automate-It.png"
        }
      },
      "features": [
        {
          "type": "TEXT_DETECTION"
        }
      ]
    }
  ]
}

Results in the following response:

{
  "responses": [
    {
      "textAnnotations": [
        {
          "locale": "en",
          "description": "OCR test\nOCR",
          "boundingPoly": {
            "vertices": [
              {
                "x": 265,
                "y": 261
              },
              {
                "x": 940,
                "y": 261
              },
              {
                "x": 940,
                "y": 324
              },
              {
                "x": 265,
                "y": 324
              }
            ]
          }
        },
        {
          "description": "OCR",
          "boundingPoly": {
            "vertices": [
              {
                "x": 265,
                "y": 281
              },
              {
                "x": 382,
                "y": 282
              },
              {
                "x": 382,
                "y": 321
              },
              {
                "x": 265,
                "y": 320
              }
            ]
          }
        },
        {
          "description": "test",
          "boundingPoly": {
            "vertices": [
              {
                "x": 396,
                "y": 282
              },
              {
                "x": 505,
                "y": 283
              },
              {
                "x": 505,
                "y": 322
              },
              {
                "x": 396,
                "y": 321
              }
            ]
          }
        },
        {
          "description": "OCR",
          "boundingPoly": {
            "vertices": [
              {
                "x": 756,
                "y": 261
              },
              {
                "x": 940,
                "y": 262
              },
              {
                "x": 940,
                "y": 324
              },
              {
                "x": 756,
                "y": 323
              }
            ]
          }
        }
      ],
      "fullTextAnnotation": {
        ...
      }
    }
  ]
}

Should the block output sv.Detections(...) with the full text match only, the word matches only, or both?

PawelPeczek-Roboflow commented 1 week ago

Hi @brunopicinin, At first, thanks for taking the challenge 💪

Regarding the question - good point, I believe that it would be good to have Workflow block output that would simply dump the whole recognised text + output with sv.Detections(...) that would denote each parsed region

brunopicinin commented 1 week ago

Created a PR for this issue: https://github.com/roboflow/inference/pull/709

PawelPeczek-Roboflow commented 1 week ago

Amazing 💪 taking review now

PawelPeczek-Roboflow commented 1 week ago

posted PR review, great thanks for contribution

PawelPeczek-Roboflow commented 1 week ago

Approved PR, merged to main, great thanks for contribution 🏅

roboflow / inference

Hacktoberfest 2024 | Google Vision OCR 🤝 Workflows #692

Google Vision OCR in Workflows

🚧 Task description 🏗️

Cheatsheet

Scaffolding for the block