run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.79k stars 157 forks source link

Error parsing books in other language even if the language parameter is set in LlamaParse #262

Open JianXiao2021 opened 2 days ago

JianXiao2021 commented 2 days ago

This is an excerpt of my book (only one page, in Russian): Dolgan_ Язык норильских долган (Ubrjatova) 3.pdf

I put the PDF file in 'tst' folder and my code is:

from llama_parse import LlamaParse
ru_parser1 = LlamaParse(result_type="text", language="ru")
documents = SimpleDirectoryReader("./tst", file_extractor={".pdf": ru_parser1}).load_data()
print(documents[0].text)

Instead of Russian text I got something like this:

IPEIUCIOBVE
        B mpezaraeMo#tBHHaHD trarejefpadore OIECHBaETCH FBHK
TpYIH AqrAH , KOYeBaBux 4O KOHua 40 -X TOHOB                                 paitoneOoJBxx

It seems that even though I specified the language parameter as Russian, LlamaParse still recognized it as English.

When I tried another Russian book with the same code, the documents[0].text is empty, no text was extracted from the PDF file: Orok_ Язык ороков (ульта) (Petrova) 22.pdf

Does LlamaParse not yet support the OCR of this kind of scanned foreign language PDF documents? Or did I miss something?