[Feature Request]: Russian language support

netandreus commented 3 months ago

Is there an existing issue for the same feature request?

[X] I have checked the existing issues.

Describe the feature you'd like

We have a lot of scientific materials that are only in Russian (physics, psychology, etc.) and we would like to make a knowledge base on them and a chatbot. Please tell me do you plan to support the Russian language? Is there any way to add it myself?

Cricricrikets commented 3 months ago

Yes, I'm also interesting in this!

Said-Apollo commented 3 months ago

When creating a knowledge base, there is the option to activate "Layout Analysis". Since this uses a visual language model (in cases of images or if not enough text was contained in a chunk), maybe this might work for russian language (although its definitely improvable)

Maybe you could try to change the "threshold" when the Visual Model should be used to interpret the text.

netandreus commented 3 months ago

@Said-Apollo I tried to do it, but it removed almost all the spaces in the text.

Here is test document in Russian: dogovor_oferta.pdf

And here are parsing results:

Said-Apollo commented 3 months ago

@Said-Apollo I tried to do it, but it removed almost all the spaces in the text.

Here is test document in Russian: dogovor_oferta.pdf

And here are parsing results:

When inputting the pdf to my knowledge base, it only gives me a single chunk with a few words. However, after converting the pdf to a docx file, it gave me around 18 chunks

Now looking closer at the result, they look somewhat correct to me (although Im not a russian expert). However, unfortunately the file is not shown next to it. I guess this is not supported for docx files yet. Maybe you could try this workaround until russian is also supported? In case you have lots of pdfs and are on linux, I would therefore recommend simply this command in terminal:

lowriter --convert-to docx *.pdf

Hyperb0t commented 2 months ago

I made a quick and dirty spaces problem fix for only Russian language. Deepdoc component and especially its pdf_parser class now do not remove spaces in RU text. Fix was merged in v 0.11 and is in the demo already.

You can observe it if you turn off layout recognition and parse the example "dogovor_oferta.pdf" or any other Russian PDF document.

turn-off-layot-recog spaces

Unfortunately, it still removes spaces if leave layout recognition turned on. I think it happens while returning stored text chunks from the backed via REST API and not while parsing. I am going to also resolve this problem. Probably by changing rmSpace function.

Hyperb0t commented 2 months ago

There are a lot of places in the project (link1, link2), where string is processed in different ways depending on if it matches [0-9a-zA-Z...] regex for english language or not. One of these differences in string processing is space symbol removal. If the string is considered english by regex, spaces are not removed, otherwise removed.

If we want multiple languages support, this logic should be changed. Because matching the [0-9a-zA-Z...] regex is not the only case, where spaces should be kept and not removed. There are other non-latin languages or groups of languages with other alphabets and writing systems, where spaces are needed:

Greek (Α α, Β β, Γ γ, Δ δ, Ε ε, Ζ ζ, Η η, Θ θ, Ι ι, Κ κ, Λ λ, Μ μ, Ν ν, Ξ ξ, Ο ο, Π π, Ρ ρ, Σ σ/ς, Τ τ, Υ υ, Φ φ, Χ χ, Ψ ψ, Ω ω.)
Cyrillic (Russian, Ukrainian, Serbian etc.) (А ,А̀ ,А̂ ,А̄ ,Ӓ ,Б ,В ,Г, Ґ ,Д ,Ђ ,Ѓ ,Е ,Ѐ ,Е̄ ,Е̂Ё ,Є ,Ж ,З ,З́ ,Ѕ ,И ,І, Ї ,Ꙇ ,Ѝ ,И̂ ,Ӣ ,Й ,Ј ,К, Л ,Љ ,М ,Н ,Њ ,О ,О̀ ,О̂, Ō ,Ӧ ,П ,Р ,С ,С́ ,Т ,Ћ, Ќ ,У ,У̀ ,У̂ ,Ӯ ,Ў ,Ӱ ,Ф, Х ,Ц ,Ч ,Џ ,Ш ,Щ ,Ꙏ ,Ъ, Ъ̀ ,Ы ,Ь ,Ѣ ,Э ,Ю ,Ю̀ ,Я, Я̀ )
Hebrew (א,ב,ג,ד,ה,ו,ז,ח,ט,י,כ,ל,מ,נ,ס,ע,פ,צ,ק,ר,ש,ת)
Arabic (ا,ب,ت,ث,ج,ح,خ,د,ذ,ر,ز,س,ش,ص,ض,ط,ظ,ع,غ,ف,ق,ك,ل,م,ن,ه,و,ي,ﺀ)
Devanagari (Indian) (ा,ि,ु,े,ो,क,ग,च,ज,ट,ड,त,द,न,प,ब,म,य,र,ल,व,स,ह,ृ,क्ष,ज्ञ,में,अ,इ,उ,ए,ओ,क्,ग्,च्,ज्,ट्,ड्,त्,द्,न्,प्,ब्,म्,य्,र्,ल्,व्,स्,ह्,़,क्ष्,ज्ञ्,है,ः,ी,ू,े,ो,ख,घ,छ,झ,ठ,ढ,थ,ध,ं,फ,भ,ण,ळ,,ञ,ङ,श,ष,ॆ,त्र,श्र,मैं,आ,ई,ऊ,ऐ,औ,ख्,घ्,छ्,झ्,ठ्,ढ्,थ्,ध्,ँ,फ्,भ्,ण्,ळ्,क्र,ञ्,ङ्,श्,ष्,,त्र्,श्र्,हूँ)
Korean (ㄱ ㄲ ㄴ ㄷ ㄸ ㄹ ㅁ ㅂ ㅃ ㅅ ㅆ ㅇ ㅈ ㅉ ㅊ ㅋ ㅌ ㅍ ㅎ ㅏ ㅐ ㅑ ㅒ ㅓ ㅔ ㅕ ㅖ ㅗ ㅘ ㅙ ㅚ ㅛ ㅜ ㅝ ㅞ ㅟ ㅠ ㅡ ㅢ ㅣ ㄱ ㄲ ㄳ ㄴ ㄵ ㄶ ㄷ ㄹ ㄺ ㄻ ㄼ ㄽ ㄾ ㄿ ㅀ ㅁ ㅂ ㅄ ㅅ ㅆ ㅇ ㅈ ㅊ ㅋ ㅌ ㅍ ㅎ)

My proposal is to replace [0-9a-zA-Z...] regex with something like "should_remove_spaces(str)" function and use it in rmSpace(str) function. By my knowledge spaces should only be removed in Chinese and Japanese languages.

I can add and use fasttext-langdetect dependency for that. If the language is not recognized as Chinese or Japanese, spaces should not be removed. This python library (fasttext-langdetect) can also be useful in future for other multi-lingual tasks.

infiniflow / ragflow

[Feature Request]: Russian language support #1987

Is there an existing issue for the same feature request?

Describe the feature you'd like