tattle-made / kosh-v2

3 stars 4 forks source link

Claim extraction from Images #72

Closed dennyabrain closed 6 months ago

dennyabrain commented 7 months ago

The various challenges involved in making sense of an image found on social media is summarized by this image Screenshot 2023-12-04 at 15-13-05 Tech Interventions against Online Harms

The images could be a photograph, a manipulated image, screenshots, newspaper clippings or a meme. We have to device a solution to extract claims out of these images using a mix of automated and manual methods that can be deployed at population scale.

Some ideas on what type of functionality ML can enable :

  1. Extract text out of images using tesseract or google cloud vision
  2. Use a multimodal model(s) to describe the image and see how well they do that for our use case
  3. Extract entity name from images
  4. Finetuning multilingual and multimodal models for our context.

This is meant to be a timebound 5 day spike with the goal to learn as much as possible about what the state of the art in LLMs and ML can help us with claim extraction. We will like to include working prototype in this so that we get a good sense of system requirements and prices. As such evaluating paid proprietary solutions like ChatGPT could also be part of this.

dennyabrain commented 7 months ago

Tuesday, Wednesday Spike

Check in on Wednesday 11 am.

duggalsu commented 7 months ago

Tested Tesseract with english and hindi LSTM models with multiple psm settings. Hindi OCR does not work with legacy+LSTM option. Tesseract still cannot handle multi-column text from images. Refer: https://muthu.co/all-tesseract-ocr-options/

Tested image pre-processing, which degraded text quality and did not improve OCR.

Relatively good OCR otherwise for English and Hindi on images

aatmanvaidya commented 7 months ago

tested out various models on hugging face and looked at (Learned Visual Model) LVM for vision. Will attach link soon

dennyabrain commented 7 months ago

Wednesday :

dennyabrain commented 7 months ago
duggalsu commented 7 months ago

It's called disfluency correction for our purpose. Refer: https://www.semanticscholar.org/search?q=disfluency%20correction&sort=relevance

duggalsu commented 7 months ago

The technical term for image understanding is Image Captioning Refer: https://en.wikipedia.org/wiki/Natural_language_generation#Image_captioning

dennyabrain commented 7 months ago

Thanks. I think an added layer of automation that would make for useful claim extraction is if we can detect the entities(people/landmarks) in a picture. So instead of the extracted claim being "a man is standing next to a building" if it said "politician X is standing next to taj mahal". we could create a dataset of persons of interest to facilitate this.

dennyabrain commented 7 months ago

Found this nice use of traditional image processing to segment portions from newspaper clippings - https://stackoverflow.com/questions/64241837/use-python-open-cv-for-segmenting-newspaper-article

Should be also useful for multi text portion memes/posters. I think these techniques might also be useful to segment portions of an image. and then those individual segments could be used for further matching queries.

aatmanvaidya commented 7 months ago

Identify the 5 most popular categories of images

Categories I could come up with -

  1. Newspaper Clippings
  2. Screenshots - these could be of social media posts, inshorts news app, whatsapp message(s), facebook posts, tweets etc. Some of these also include memes
  3. Information Posters - posters communicating some kind of information like india's gdp growth, sharing facts about a topic, sharing details about how a political party led to development, sharing info around sports,
  4. Letter(s) - some sort of complaint letters or information letters, letters to the govt regarding some issues.
  5. Other - news headlines

(In the dataset, I saw some images repeat)

Extract Text from Images (Vision Encoder Decoder Models)

  1. nougat-base - A Donut trained model to extract text from images. Works well for short English text, fails when the text is long (newspaper clippings etc). Doesn't work for Indic Languages.
  2. Few other models - perform poorly both on English and Hindi text in images.
  3. Transformer Based OCR's - decent text extraction for small length of text in English, performed poorly for Hindi text in images. Some gibberish pops out
  4. Awesome Transformer Based OCR - https://github.com/EriCongMa/awesome-transformer-ocr
  5. LayoutLM - https://huggingface.co/impira/layoutlm-document-qa - this is more for image understanding, but can also sometimes extract text - doesn't work for Indic languages.
  6. Azure AI Vision - https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/overview-ocr - This supports Hindi as per the Microsoft article.
  7. A question - What tools are other using for similar appplications?
  8. Easy OCR - https://github.com/JaidedAI/EasyOCR
  9. Keras OCR - https://github.com/faustomorales/keras-ocr
  10. Multilingual OCR for Indic Scripts

Detect the entities(people/landmarks) in a picture

  1. VisualBERT - https://github.com/huggingface/transformers/tree/main/examples/research_projects/visual_bert
  2. https://huggingface.co/nlpconnect/vit-gpt2-image-captioning (Aurora has also found this)
  3. https://huggingface.co/google/vit-base-patch16-224 - prints out different objects in an image.
  4. https://huggingface.co/openai/clip-vit-base-patch32 - this the model by OpenAI. The only drawback I found is that, we have to input the prediction labels and then it will compute which of the label's will have the highest chance of being in the photo.
  5. GIT (GenerativeImage2Text) based models - describes the image is about
  6. Vision-and-Language Transformer (ViLT) - https://huggingface.co/dandelin/vilt-b32-finetuned-vqa
    • The best part of this model is that you can ask it questions around the image, like, "What is on the top of the tower?", "What is the man eating?" etc

Gibberish Text Detection

  1. https://stackoverflow.com/questions/68867789/python-pytesseract-module-returning-gibberish-from-an-image
  2. https://stackoverflow.com/questions/57377470/tesseract-showing-gibberish
  3. https://stackoverflow.com/questions/39835546/how-to-remove-gibberish-that-exhibits-no-pattern-using-python-nltk
  4. https://medium.com/analytics-vidhya/text-processing-tools-i-wish-i-knew-earlier-a6960e16a9c9

Large Vision Models (LVM)

  1. Sequential Modeling Enables Scalable Learning for Large Vision Models.
  2. LVM-Med
  3. LayoutLMV2

Other

  1. https://huggingface.co/blog/vision_language_pretraining
  2. https://huggingface.co/docs/transformers/main/en/model_doc/vision-encoder-decoder
  3. An encoder-decoder based framework for hindi image caption generation
  4. A Scaled Encoder Decoder Network for Image Captioning in Hindi

GPT4-Vision

  1. The documentation itself has good examples on how to use Vision API - https://platform.openai.com/docs/guides/vision
  2. GPT-4-Vision Interesting Uses and Examples Thread (2023) - A great code example on how to use GPT4-Vision
  3. https://tmmtt.medium.com/how-to-use-gpt-4-vision-api-ba6b57af569c
  4. Various use case examples of GPT4 Vision with python code - https://github.com/Anil-matcha/GPT-4-Vision-Chatbot/tree/main

SAM

  1. https://github.com/kadirnar/segment-anything-video
  2. https://blog.roboflow.com/how-to-use-segment-anything-model-sam/
  3. https://colab.research.google.com/github/roboflow-ai/notebooks/blob/main/notebooks/how-to-segment-anything-with-sam.ipynb
  4. Many CV models (even for segment) - https://github.com/roboflow/notebooks
  5. https://www.youtube.com/watch?v=D-D6ZmadzPE
duggalsu commented 7 months ago

"Disfluency correction" is for text output from Automatic Speech Recognition (ASR). It will not capture OCR text errors

We should look at "OCR correction"

"Image Captioning" Tried a few existing popular models

dennyabrain commented 7 months ago

@aatmanvaidya can you try out two things that I believe will be useful pre-processing steps regardless of what model we use :

  1. Segmenting images to split an image into its components, could be pictures (in information poster), text blobs (in news paper clippings) etc
  2. Face detection and saving the face in a different file
aatmanvaidya commented 7 months ago

Summary

From my perspective, writing a rough pipeline that could be followed

Once we have the image, we could follow a process like this

  1. Identify text using simple image processing techniques.
    • Image extraction tools cannot extract text properly where text is present in columns (new paper clippings is a popular example)
  2. Extract text from that identified text portion using Tesseract or EasyOCR
  3. Remove gibberish from the text.
dennyabrain commented 7 months ago

Swair had a long response to this, I am cherry picking insights and typing here :

  1. For our image segmentation task, swair recommended Meta's SAM model
  2. He said paying for GPT4 vision could be an interesting exercise to compare performance. His back of the napkin calculation was that it should take 3.5$ per 1000 images.

He also said our approach of segmenting relevant portions and indexing it might be interesting/publishable.

dennyabrain commented 7 months ago

@aatmanvaidya @duggalsu can summarize a 5 line blurb on the various text extraction models and libraries they used and their conclusions.

dennyabrain commented 7 months ago

References shared on the call : https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/ https://cvit.iiit.ac.in/

aatmanvaidya commented 7 months ago

Summary of the CDT report

dennyabrain commented 7 months ago

We should use today to test out one of the remaining solutions

  1. I have the chatgpt4 keys. So we should be able to try out gpt4vision
  2. Lets see how the SAM model from above performs well.
  3. Try out google cloud vision also. Especially to check how it performs on indic languages.

End of Spike Requirements : Put together a self contained slide with all your findings. We would like to keep it handy when we talk about the status on possibility for claim extraction work. I think a good way to structure the slides would be to have sections on the problem statement "Extract text from image", "Caption an image" and then mention the technique(s) used and the results they gave. Fill it with as many examples as possible. Its best to be able to see them to truly understand. Share the good examples but also the really bad examples of the tech failing.

duggalsu commented 7 months ago

GPT4-Vision does not seem good for any kind of OCR - it will not do OCR for copyrighted articles in English and does not work well for Hindi

However, it can describe the image in detail i.e. do "image captioning" very well, better than the previously tested huggingface models

aatmanvaidya commented 6 months ago

https://github.com/VikParuchuri/surya

Surya - A SOTA tool for multilingual OCR Surya is a multilingual document OCR toolkit. It can do:

Accurate line-level text detection Text recognition (coming soon) Table and chart detection (coming soon) It works on a range of documents and languages (see usage and benchmarks for more details).