Closed dennyabrain closed 6 months ago
Check in on Wednesday 11 am.
Tested Tesseract with english and hindi LSTM models with multiple psm
settings. Hindi OCR does not work with legacy+LSTM
option. Tesseract still cannot handle multi-column text from images.
Refer: https://muthu.co/all-tesseract-ocr-options/
Tested image pre-processing, which degraded text quality and did not improve OCR.
Relatively good OCR otherwise for English and Hindi on images
tested out various models on hugging face and looked at (Learned Visual Model) LVM for vision. Will attach link soon
It's called disfluency correction
for our purpose.
Refer: https://www.semanticscholar.org/search?q=disfluency%20correction&sort=relevance
The technical term for image understanding is Image Captioning
Refer: https://en.wikipedia.org/wiki/Natural_language_generation#Image_captioning
Thanks. I think an added layer of automation that would make for useful claim extraction is if we can detect the entities(people/landmarks) in a picture. So instead of the extracted claim being "a man is standing next to a building" if it said "politician X is standing next to taj mahal". we could create a dataset of persons of interest to facilitate this.
Found this nice use of traditional image processing to segment portions from newspaper clippings - https://stackoverflow.com/questions/64241837/use-python-open-cv-for-segmenting-newspaper-article
Should be also useful for multi text portion memes/posters. I think these techniques might also be useful to segment portions of an image. and then those individual segments could be used for further matching queries.
Categories I could come up with -
Newspaper Clippings
Screenshots
- these could be of social media posts, inshorts news app, whatsapp message(s), facebook posts, tweets etc. Some of these also include memesInformation Posters
- posters communicating some kind of information like india's gdp growth, sharing facts about a topic, sharing details about how a political party led to development, sharing info around sports, Letter(s)
- some sort of complaint letters or information letters, letters to the govt regarding some issues. Other
- news headlines(In the dataset, I saw some images repeat)
A question
- What tools are other using for similar appplications?"Disfluency correction" is for text output from Automatic Speech Recognition (ASR). It will not capture OCR text errors
We should look at "OCR correction"
"Image Captioning" Tried a few existing popular models
@aatmanvaidya can you try out two things that I believe will be useful pre-processing steps regardless of what model we use :
From my perspective, writing a rough pipeline that could be followed
Once we have the image, we could follow a process like this
Tesseract
or EasyOCR
Swair had a long response to this, I am cherry picking insights and typing here :
He also said our approach of segmenting relevant portions and indexing it might be interesting/publishable.
@aatmanvaidya @duggalsu can summarize a 5 line blurb on the various text extraction models and libraries they used and their conclusions.
References shared on the call : https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/ https://cvit.iiit.ac.in/
resourcedness gap
.We should use today to test out one of the remaining solutions
End of Spike Requirements : Put together a self contained slide with all your findings. We would like to keep it handy when we talk about the status on possibility for claim extraction work. I think a good way to structure the slides would be to have sections on the problem statement "Extract text from image", "Caption an image" and then mention the technique(s) used and the results they gave. Fill it with as many examples as possible. Its best to be able to see them to truly understand. Share the good examples but also the really bad examples of the tech failing.
GPT4-Vision does not seem good for any kind of OCR - it will not do OCR for copyrighted articles in English and does not work well for Hindi
However, it can describe the image in detail i.e. do "image captioning" very well, better than the previously tested huggingface models
https://github.com/VikParuchuri/surya
Surya - A SOTA tool for multilingual OCR
Surya is a multilingual document OCR toolkit. It can do:
Accurate line-level text detection Text recognition (coming soon) Table and chart detection (coming soon) It works on a range of documents and languages (see usage and benchmarks for more details).
The various challenges involved in making sense of an image found on social media is summarized by this image![Screenshot 2023-12-04 at 15-13-05 Tech Interventions against Online Harms](https://github.com/tattle-made/kosh-v2/assets/1415361/2c31439e-97fb-4c0b-a16d-3a185c41ad8a)
The images could be a photograph, a manipulated image, screenshots, newspaper clippings or a meme. We have to device a solution to extract claims out of these images using a mix of automated and manual methods that can be deployed at population scale.
Some ideas on what type of functionality ML can enable :
This is meant to be a timebound 5 day spike with the goal to learn as much as possible about what the state of the art in LLMs and ML can help us with claim extraction. We will like to include working prototype in this so that we get a good sense of system requirements and prices. As such evaluating paid proprietary solutions like ChatGPT could also be part of this.