infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
21.35k stars 2.09k forks source link

[Feature Request]: Russian language support #1987

Open netandreus opened 2 months ago

netandreus commented 2 months ago

Is there an existing issue for the same feature request?

Describe the feature you'd like

We have a lot of scientific materials that are only in Russian (physics, psychology, etc.) and we would like to make a knowledge base on them and a chatbot. Please tell me do you plan to support the Russian language? Is there any way to add it myself?

Cricricrikets commented 2 months ago

Yes, I'm also interesting in this!

Said-Apollo commented 2 months ago

When creating a knowledge base, there is the option to activate "Layout Analysis". Since this uses a visual language model (in cases of images or if not enough text was contained in a chunk), maybe this might work for russian language (although its definitely improvable) image

Maybe you could try to change the "threshold" when the Visual Model should be used to interpret the text.

netandreus commented 2 months ago

@Said-Apollo I tried to do it, but it removed almost all the spaces in the text.

Here is test document in Russian: dogovor_oferta.pdf

And here are parsing results:

Screenshot 2024-08-19 at 15 46 27
Said-Apollo commented 2 months ago

@Said-Apollo I tried to do it, but it removed almost all the spaces in the text.

Here is test document in Russian: dogovor_oferta.pdf

And here are parsing results: Screenshot 2024-08-19 at 15 46 27

When inputting the pdf to my knowledge base, it only gives me a single chunk with a few words. However, after converting the pdf to a docx file, it gave me around 18 chunks image

Now looking closer at the result, they look somewhat correct to me (although Im not a russian expert). However, unfortunately the file is not shown next to it. I guess this is not supported for docx files yet. image Maybe you could try this workaround until russian is also supported? In case you have lots of pdfs and are on linux, I would therefore recommend simply this command in terminal:

lowriter --convert-to docx *.pdf

Hyperb0t commented 1 month ago

I made a quick and dirty spaces problem fix for only Russian language. Deepdoc component and especially its pdf_parser class now do not remove spaces in RU text. Fix was merged in v 0.11 and is in the demo already.

You can observe it if you turn off layout recognition and parse the example "dogovor_oferta.pdf" or any other Russian PDF document.

turn-off-layot-recog spaces

Unfortunately, it still removes spaces if leave layout recognition turned on. I think it happens while returning stored text chunks from the backed via REST API and not while parsing. I am going to also resolve this problem. Probably by changing rmSpace function.

Hyperb0t commented 1 month ago

There are a lot of places in the project (link1, link2), where string is processed in different ways depending on if it matches [0-9a-zA-Z...] regex for english language or not. One of these differences in string processing is space symbol removal. If the string is considered english by regex, spaces are not removed, otherwise removed.

If we want multiple languages support, this logic should be changed. Because matching the [0-9a-zA-Z...] regex is not the only case, where spaces should be kept and not removed. There are other non-latin languages or groups of languages with other alphabets and writing systems, where spaces are needed:

My proposal is to replace [0-9a-zA-Z...] regex with something like "should_remove_spaces(str)" function and use it in rmSpace(str) function. By my knowledge spaces should only be removed in Chinese and Japanese languages.

I can add and use fasttext-langdetect dependency for that. If the language is not recognized as Chinese or Japanese, spaces should not be removed. This python library (fasttext-langdetect) can also be useful in future for other multi-lingual tasks.