infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
19.82k stars 1.98k forks source link

[Question]: Bad Parsing #1656

Closed Said-Apollo closed 1 month ago

Said-Apollo commented 2 months ago

Describe your problem

Hi there, during my testing it became more and more clear that something is quite wrong witht he parsing/ocr method. When e.g. inputting a 30page scientific paper, and setting some default parameters (using Paper as ChunkingStrategy and once Laws) it only gave quite bad results. An example image showing the chunk and the respective original part is attached. I guess it was already tried to fix this in issue #1407 but it's still quite bad performance wise.

image

For many chunks usually the beginning and end are quite bad, but I also noticed a lot of chunk are also entirely bad, no matter what method I use.

Said-Apollo commented 2 months ago

I just saw that I cloned the repo around 2weeks ago and a few hours afterwards the parser was updated. Will have a look at it and write again

Said-Apollo commented 2 months ago

After using the adapted script, results got much better. However, there are still following points (for improvement)

image

image

Said-Apollo commented 2 months ago

Hi, this issue is still not completely fixed. Look at below example (german text of an EU law) image

Upon closer inspection, if one looks at the first three words on the left side(lichen oder Auftragsverarbeiter) where lichen is part of the word "Verantwortlichen", we can find them again on the 3rd row of the right side (yellow marked) Its as if the sentences/words are mixed. I updated the DeepDoc code last week, which improved overall quality, but still not goog enough.