infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
18.18k stars 1.84k forks source link

fix parsing spaces in russian language PDFs (#1987) #2427

Closed Hyperb0t closed 1 week ago

Hyperb0t commented 2 weeks ago

What problem does this PR solve?

#1987

When scanning PDF files character by character, the parser excluded spaces if the string did not match regex. Text from Russian documents needs spaces, but it does not match the regex because it uses different alphabet. That's why PDFs were parsed incorrectly and were almost unusable as source. Fixed that by adding Russian alphabet to regex.

There might be problems with other languages that use different alphabets. I additionally tested PDF in Spanish and old [a-zA-Z...] regex parses it correctly with spaces.

Type of change