run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
2.41k stars 245 forks source link

unable to read vertical orientated chinese traditional words #295

Open tkcoding opened 1 month ago

tkcoding commented 1 month ago

Issue Vertical orientated chinese document unable to return any extraction.

Code to reproduce

from llama_parse import LlamaParse
from llama_parse.utils import (
    nest_asyncio_err,
    nest_asyncio_msg,
    ResultType,
    Language,
    SUPPORTED_FILE_TYPES,
)
parser = LlamaParse(language=Language.TRADITIONAL_CHINESE,verbose=True)
doc = parser.get_json_result("sample_chinese.pdf")
print(doc[0]['pages'])

file sample_chinese.pdf

output [{'page': 1, 'text': '', 'md': '', 'images': [], 'items': []}, {'page': 2, 'text': '', 'md': '', 'images': [], 'items': []}, {'page': 3, 'text': '', 'md': '', 'images': [], 'items': []}, {'page': 4, 'text': '', 'md': '', 'images': [], 'items': []}, {'page': 5, 'text': '', 'md': '', 'images': [], 'items': []}, {'page': 6, 'text': '', 'md': '', 'images': [], 'items': []}]

expected output chinese traditional wording with right sequence being extracted

hexapode commented 1 month ago

Thanks For reporting.

It seems like we are not able to extract the images from this pdf to send them to the OCR model. We are now looking at a fix.

tkcoding commented 1 month ago

@hexapode

alternatively , It seems like gpt4o can translate it. Is there a way to do have a structure of if the images OCR doesnt return any results and not a blank page. It will redirect it to gpt4o_mode=True to return the results?

The output of gpt4o is wrong as well (it duplicate a lot of the wording at different location).

parser = LlamaParse(gpt4o_mode=True,verbose=True)
doc = parser.get_json_result("sample_chinese.pdf")