run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.39k stars 4.98k forks source link

[Question]: How to solve the problem of garbled characters caused by pictures in the read doc #15636

Open cxycxm opened 2 weeks ago

cxycxm commented 2 weeks ago

Question Validation

Question

How to solve the problem of garbled characters caused by pictures in the read doc

dosubot[bot] commented 2 weeks ago

To solve the problem of garbled characters caused by pictures in the read document, you can use Optical Character Recognition (OCR) to extract text from images. LlamaIndex supports OCR using the pytesseract library. Here is an example of how you can handle this:

  1. Install the required libraries:

    pip install pytesseract Pillow
  2. Use the process_image method to extract text from images:

    from io import BytesIO
    import pytesseract
    from PIL import Image
    import requests
    
    def process_image(link):
       text = ""
       try:
           response = requests.get(link)
           if response.status_code == 200 and response.content:
               try:
                   image = Image.open(BytesIO(response.content))
                   text = pytesseract.image_to_string(image)
               except OSError:
                   print(f"Error processing image at {link}: Unable to open or read the image content.")
           else:
               print(f"Error fetching image at {link}: HTTP status code {response.status_code}.")
       except requests.exceptions.RequestException as e:
           print(f"Request error while fetching image at {link}: {e}")
       return text

This method fetches the image from the provided link, processes it using pytesseract to extract text, and returns the extracted text [1].

Additionally, ensure that your document reader decodes in UTF-8 and handles encoding issues properly to avoid garbled characters [2][3].

To continue talking to Dosu, mention @dosu.