Open harsh2ai opened 4 years ago
@harsh2ai could you please detail your feature request with a concret examples or API suggestions. Thanks !
@vfdev-5 ocr is used for many wide variety of tasks, some of which include.
Having an API which can cover all these tasks would be really handy.
What I propose is something like this
job1=vision.ocr.invoice(path, format(jpg,png,pdf),page number=None,)
job1.display() #to display the extracted information from invoice(bounded boxes)
job1.extract(format=(csv,word,txt), path) #path is the where the information is finally stored after processing
job1.save()
The above example is for extracting invoices from different data formats(like tabular data, etc) and saving them to csv, word,txt
Similarly, it can be used for other features as well for extraction of information from different tasks as well as mentioned.
@harsh2ai thanks for the details. IMO, it is a bit out of the scope of torchvision as a very specific high-level application. Anyway, let's see what @fmassa think.
Adding a few points. OCR is another task in computer vision (quite common since bank cheque processing was the first thing that used CV). A few papers that are good in OCR
These two are new offer good SOTA. But they differ as text detection in wild scenes vs text detection on handwritten or text documents differ.
Tessaract is a library which was being worked on for the past many years.
I think the best way we can provide it is torchvision.models.ocr.
and probably provide two of these. But I guess it might be too early for them.
But to provide such a high-level API is a huge overhead in both maintenance and compatibility. So I guess probably it might not be possible. We don't provide such high level for detection
as well as segmentation
models for now as well.
Are there are any highly cited datasets in OCR @harsh2ai?
We can consider them adding to torchvision.datasets.
which might ease preprocessing.
@oke-aditya some of the datasets include
there are many others as well including
dataset | year |
---|---|
Born-Digital Images (Web and Email) | 2011-2015 |
COCO-Text | 2017 |
Text Extraction from Biomedical Literature Figures | 2017 |
Focused Scene Text | 2013-2015 |
Text in Videos | 2013-2015 |
Incidental Scene Text | 2015 |
The Chars74K dataset | 2009 |
The Uber Text dataset | 2017 |
The Street View Text Dataset | 2012 |
The Street View House Numbers (SVHN) Dataset | 2011 |
dataset | year |
---|---|
mnist | 1998 |
NIST Special Database 19 | 1995-2016 |
The EMNIST Dataset | 2017 |
IAM Handwriting Database | 1999-2002 |
CASIA Online and Offline Chinese Handwriting Databases | 2007-2010 |
CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions | 2012-2013 |
dataset | year |
---|---|
ETL Character Database | 1973-1984 |
what do you think about it @oke-aditya , I guess we can add a few datasets begin and check with the preprocessing.
I think we should wait for @fmassa and @pmeier think over this.
I agree with @vfdev-5 and @oke-aditya here. torchvision
only includes basic building blocks for CV tasks as classification, detection and so on. OCR is a higher level application and thus should be in a separate package which might depend on torchvision
.
@harsh2ai Additional to the points made above by the others, adding OCR to torchvision
would open it to a lot more applications. Nothing is special about OCR that is should be added, but not some other functionality of similar complexity.
@pmeier should I take the task then and start working on it?
Not sure I understand you: start with what?
@pmeier By task I meant, working on this high level api
Hey @harsh2ai . I am quite unsure if you make PR for OCR might go into torchvision. Can you elaborate a little bit on more on what you plan to do?
CRAFT can be built from the building blocks that torchvision provides. (VGG backbone, with some modifications) EAST needs a really separate and specific CNN architecture that torchvision does not support. (It's really not possible to support all the architectures)
Building blocks which have vast and common use cases can be provided by torchvision.
Some small components which I see being used were
But a high-level API should rather be separate package (you can build on top of torchvision)
@oke-aditya I get you now, I will build a separate package on top of torchvision instead
🚀 Feature
Pytorch vision library has many high-level API for performing the tasks under the hood seamlessly if there can be a high-level API for OCR tasks then downloading lots of third party libraries could be avoided.
Motivation
When I was building one such ocr system I found that there is no generalized way to recognize and text from various image/document formats even due to slight change in the format structure of document/images the systems tend to fail, at times we have to resort to using many traditional machine learning techniques which are highly time-consuming to get the desired results.
Pitch
OCR has high application in the industry millions of industry use this software on a daily basis and spends a lot of money in maintaining them they are used in medical, finance, delivery and many other domains for verification and data entry/storage purpose but despite having a use case over a wide number of industries there is no single solution which everyone can use and keep improving over the years, instead they rely on the old methods of creating one's own solution for this purpose.
If such an API is integrated in PyTorch then many businesses can shift to digital platforms and can increase their productivity, apart from this it will also be useful to the school and university students in a wide range of tasks.
Lastly once integrated into Pytorch it will be freely available for everyone to use since OCR software are very expensive and it can be improved over the years to come.
Alternatives
At present we only have tesseract which is capable of leveraging deep learning tasks for ocr but tesseract has its own set of problems and cannot be used everywhere.
and required a lot of preprocessing from other libraries to be done beforehand in order to use it