pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.32k stars 6.97k forks source link

High Level API for OCR tasks #2753

Open harsh2ai opened 4 years ago

harsh2ai commented 4 years ago

🚀 Feature

Pytorch vision library has many high-level API for performing the tasks under the hood seamlessly if there can be a high-level API for OCR tasks then downloading lots of third party libraries could be avoided.

Motivation

When I was building one such ocr system I found that there is no generalized way to recognize and text from various image/document formats even due to slight change in the format structure of document/images the systems tend to fail, at times we have to resort to using many traditional machine learning techniques which are highly time-consuming to get the desired results.

Pitch

OCR has high application in the industry millions of industry use this software on a daily basis and spends a lot of money in maintaining them they are used in medical, finance, delivery and many other domains for verification and data entry/storage purpose but despite having a use case over a wide number of industries there is no single solution which everyone can use and keep improving over the years, instead they rely on the old methods of creating one's own solution for this purpose.

If such an API is integrated in PyTorch then many businesses can shift to digital platforms and can increase their productivity, apart from this it will also be useful to the school and university students in a wide range of tasks.

Lastly once integrated into Pytorch it will be freely available for everyone to use since OCR software are very expensive and it can be improved over the years to come.

Alternatives

At present we only have tesseract which is capable of leveraging deep learning tasks for ocr but tesseract has its own set of problems and cannot be used everywhere.

and required a lot of preprocessing from other libraries to be done beforehand in order to use it

vfdev-5 commented 4 years ago

@harsh2ai could you please detail your feature request with a concret examples or API suggestions. Thanks !

harsh2ai commented 4 years ago

@vfdev-5 ocr is used for many wide variety of tasks, some of which include.

  1. Invoice ocr
  2. Passport OCR 3.Number Plate OCR
  3. Text extraction from images
  4. OCR for Digital and analogue metres

Having an API which can cover all these tasks would be really handy.

What I propose is something like this

     job1=vision.ocr.invoice(path, format(jpg,png,pdf),page number=None,)
     job1.display() #to display the extracted information from invoice(bounded boxes)
     job1.extract(format=(csv,word,txt), path) #path is the where the information is finally stored after processing
     job1.save()

The above example is for extracting invoices from different data formats(like tabular data, etc) and saving them to csv, word,txt

Similarly, it can be used for other features as well for extraction of information from different tasks as well as mentioned.

vfdev-5 commented 4 years ago

@harsh2ai thanks for the details. IMO, it is a bit out of the scope of torchvision as a very specific high-level application. Anyway, let's see what @fmassa think.

oke-aditya commented 4 years ago

Adding a few points. OCR is another task in computer vision (quite common since bank cheque processing was the first thing that used CV). A few papers that are good in OCR

  1. EAST: CVPR 2017 Cited by 594
  2. CRAFT: CVPR 2019 Cited by 86

These two are new offer good SOTA. But they differ as text detection in wild scenes vs text detection on handwritten or text documents differ. Tessaract is a library which was being worked on for the past many years. I think the best way we can provide it is torchvision.models.ocr. and probably provide two of these. But I guess it might be too early for them.

But to provide such a high-level API is a huge overhead in both maintenance and compatibility. So I guess probably it might not be possible. We don't provide such high level for detection as well as segmentation models for now as well.

Are there are any highly cited datasets in OCR @harsh2ai? We can consider them adding to torchvision.datasets. which might ease preprocessing.

harsh2ai commented 4 years ago

@oke-aditya some of the datasets include

  1. ICDAR 2011
  2. ICDAR 2013
  3. ICDAR 2015
  4. MSRA Dataset
  5. Synth Text in the Wild

there are many others as well including

printed

dataset year
Born-Digital Images (Web and Email) 2011-2015
COCO-Text 2017
Text Extraction from Biomedical Literature Figures 2017
Focused Scene Text 2013-2015
Text in Videos 2013-2015
Incidental Scene Text 2015
The Chars74K dataset 2009
The Uber Text dataset 2017
The Street View Text Dataset 2012
The Street View House Numbers (SVHN) Dataset 2011

handwritten

dataset year
mnist 1998
NIST Special Database 19 1995-2016
The EMNIST Dataset 2017
IAM Handwriting Database 1999-2002
CASIA Online and Offline Chinese Handwriting Databases 2007-2010
CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions 2012-2013

mixed printed and handwritten

dataset year
ETL Character Database 1973-1984

what do you think about it @oke-aditya , I guess we can add a few datasets begin and check with the preprocessing.

oke-aditya commented 4 years ago

I think we should wait for @fmassa and @pmeier think over this.

pmeier commented 4 years ago

I agree with @vfdev-5 and @oke-aditya here. torchvision only includes basic building blocks for CV tasks as classification, detection and so on. OCR is a higher level application and thus should be in a separate package which might depend on torchvision.

@harsh2ai Additional to the points made above by the others, adding OCR to torchvision would open it to a lot more applications. Nothing is special about OCR that is should be added, but not some other functionality of similar complexity.

harsh2ai commented 4 years ago

@pmeier should I take the task then and start working on it?

pmeier commented 4 years ago

Not sure I understand you: start with what?

harsh2ai commented 4 years ago

@pmeier By task I meant, working on this high level api

oke-aditya commented 4 years ago

Hey @harsh2ai . I am quite unsure if you make PR for OCR might go into torchvision. Can you elaborate a little bit on more on what you plan to do?

CRAFT can be built from the building blocks that torchvision provides. (VGG backbone, with some modifications) EAST needs a really separate and specific CNN architecture that torchvision does not support. (It's really not possible to support all the architectures)

Building blocks which have vast and common use cases can be provided by torchvision.

Some small components which I see being used were

  1. Rotated boxes (might come in torchvision #2761 )
  2. Locality sensitive NMS (Needs discussion, there are multiple variants of NMS and this is also one of them) I'm not sure about other components that are re-usable and might be very generalized.

But a high-level API should rather be separate package (you can build on top of torchvision)

harsh2ai commented 4 years ago

@oke-aditya I get you now, I will build a separate package on top of torchvision instead