infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
12.55k stars 1.22k forks source link

[Bug]: For text PDF files, OCR reduces the recognition rate. It is recommended to add an option to enable OCR in the 【Chunk Method】if can not do it. #1399

Open chinamerp opened 2 weeks ago

chinamerp commented 2 weeks ago

Is there an existing issue for the same bug?

Branch name

main

Commit ID

de61009

Other environment information

No response

Actual behavior

'23年\u30002月铀\u3000\u3000矿\u3000\u3000\u3000\u3000\u3000地\u3000\u3000质Vol.23No.220073Urani umGeol ogy Mar . 2007鄂尔多斯盆地直罗组赋铀沉积相与油气蚀变带的时空配置①漆富成,秦明宽,刘武生,肖树青,王志明,邹顺根,黄净白(核工业北京地质研究院,北京\u3000100029)法[论]证了赋存于直罗组地层中的C烃1/气∑来C源、于C1三/C叠2 系上统湖相生油岩系(Z 中与油型、Φ气同源而伴生的凝析)气,藏论述了直罗组下段的褪色蚀变漂白是渗逸到该层位中的酸解烃类气体对岩石发生蚀变作用的结,果提出沉积相组合与油气阻滞储(集域)的空间配置是油气蚀变带与铀矿化层位空间定位的制约因素,-关键。词[文章编]号\u3000\u3000中图分类号-\u3000\u3000文献标[]1000-0658(2007)02-0065-06[]P593[]A鄂镇尔砂多岩斯是盆该地盆中地侏的罗主世要直铀罗矿组化下层段位中的该砂粉岩砂)岩,统夹称泥为岩七;上里岩镇段砂由岩灰,上绿部色为砂灰岩绿和色灰泥绿质色七层里位在空间上包括了褪色蚀变带漂白。带泥岩组成下岩段发育辫状河流相河道亚相与部分原生带铀矿化范围受褪色(带发育范)砂体其下。有延安组湖沼相泥岩和煤层作为隔围和上部绿色,蚀变带的控制本文依据直罗水层,其上以河间泛洪平原相河漫滩相和滨组中酸解烃的组成来源成。因类型和聚储浅湖,相泥质粉砂岩泥岩稳定、发育为特征既规律从沉积相组合、与油、气阻滞储集域配置可作为层间氧化带、发育的隔水层又是后,期的角,度探讨了烃气蚀变作用与-铀矿化的空油气渗逸的隔挡层这种泥砂泥,煤结构,的岩性岩相组合。使夹持于-其-间的(直)罗组下间定位特征。亚段砂-体十分有利,于形成层间氧化带砂岩型铀矿化也正是其顶部稳定发育的河漫滩相1\u3000直直罗罗组组下下段段不岩整性合于岩延相安组组合之粉砂质,泥岩和滨浅湖相泥岩对渗逸油气的隔特征可以划分为上下两个岩段,下岩段挡和阻滞作用构成了直罗组下段岩性岩相-岩的相下部为灰色砂岩中部、为灰白色砂。岩漂白组合与油气阻,滞储集域的空间配置-,(-。①果“”(H340-1)收。[作者简介]2漆00富6-0成4-10事[铀矿床地球]化学研究(1962-),,(),2003,,从。(C)1994-2022 China Academic Journal Electronic Publishing House. All rights reserved. hltp://www.cn'

debug

Expected behavior

'第23卷2007年\u3000第2期3月铀\u3000\u3000矿\u3000\u3000地\u3000\u3000质Urani um\u3000\u3000\u3000Geol ogy Vol.23Mar . No.22007鄂尔多斯盆地直罗组赋铀沉积相与油气蚀变带的时空配置①漆富成,秦明宽,刘武生,肖树青,王志明,邹顺根,黄净白(核工业北京地质研究院,北京\u3000100029)[摘要]本文依据酸解烃特征参数C1/∑C、C1/C+2 及酸解烃判别分析(Z 值法、Φ值法及判别因子法),论证了赋存于直罗组地层中的烃气来源于三叠系上统湖相生油岩系中与油型气同源而伴生的凝析气藏,论述了直罗组下段的褪色蚀变(漂白)是渗逸到该层位中的酸解烃类气体对岩石发生蚀变作用的结果,提出沉积相组合与油气阻滞-储集域的空间配置是油气蚀变带与铀矿化层位空间定位的制约因素。[关键词]鄂尔多斯盆地;油气蚀变作用;沉积相组合;油气阻滞-储集域配置[文章编号]1000-0658(2007)02-0065-06\u3000\u3000[中图分类号]P593\u3000\u3000[文献标识码]A①国防科工委核能开发项目“鄂尔多斯盆地地浸砂岩型铀矿成矿环境及综合预测评价”(地H340-1)部分成果。[收稿日期]2006-04-10[作者简介]漆富成(1962-),男,高级工程师(研究员级),2003年毕业于日本大学,理学博士,主要从事铀矿床地球化学研究。鄂尔多斯盆地中侏罗世直罗组下段中的七里镇砂岩是该盆地的主要铀矿化层位。该层位在空间上包括了褪色蚀变带(漂白带)与部分原生带,铀矿化范围受褪色带发育范围和上部绿色蚀变带的控制。本文依据直罗组中酸解烃的组成、来源、成因类型和聚储规律,从沉积相组合与油气阻滞-储集域配置的角度,探讨了烃气蚀变作用与铀矿化的空间定位特征。1\u3000直罗组下段岩性岩相组合直罗组下段不整合于延安组之上,按岩性-岩相特征可以划分为上、下两个岩段。下岩段的下部为灰色砂岩,中部为灰白色砂岩(漂白砂岩),统称为七里镇砂岩,上部为灰绿色泥质粉砂岩夹泥岩;上岩段由灰绿色砂岩和灰绿色泥岩组成。下岩段发育辫状河流相河道亚相砂体,其下有延安组湖沼相泥岩和煤层作为隔水层,其上以河间泛洪平原相、河漫滩相和滨浅湖相泥质粉砂岩、泥岩稳定发育为特征,既可作为层间氧化带发育的隔水层,又是后期油气渗逸的隔挡层。这种泥-砂-泥(煤)结构的岩性-岩相组合,使夹持于其间的直罗组下亚段砂体十分有利于形成层间氧化带砂岩型铀矿化,也正是其顶部稳定发育的河漫滩相粉砂质泥岩和滨浅湖相泥岩对渗逸油气的隔挡和阻滞作用,构成了直罗组下段岩性-岩相组合与油气阻滞-储集域的空间配置。'

pdf

Steps to reproduce

upload text pdf file then parse the file.

Additional information

No response

octoberweb69 commented 2 weeks ago

If the text in the PDF is readable, why use OCR?

KevinHuSh commented 2 weeks ago

Accturaly, it is not the result from OCR. If text can be read from PDF, it will not bother using OCR. The problem is that sometime text read from PDF is not good enough such as yours. And there's lack of method to judge it good enough or not.

chinamerp commented 2 weeks ago

And there's lack of method to judge it good enough or not.

In this case, you can provide an option and let users judge it

chinamerp commented 2 weeks ago

Accturaly, it is not the result from OCR. If text can be read from PDF, it will not bother using OCR.

_**the param chars is text from PDF and is good enouth.  the merge operation make the results worse.  you can private a option to disable ocr merge. user can judge it.**_
def __ocr(self, pagenum, img, chars, ZM=3):
    bxs = self.ocr.detect(np.array(img))
    if not bxs:
        self.boxes.append([])
        return
    bxs = [(line[0], line[1][0]) for line in bxs]
    bxs = Recognizer.sort_Y_firstly(
        [{"x0": b[0][0] / ZM, "x1": b[1][0] / ZM,
          "top": b[0][1] / ZM, "text": "", "txt": t,
          "bottom": b[-1][1] / ZM,
          "page_number": pagenum} for b, t in bxs if b[0][0] <= b[1][0] and b[0][1] <= b[-1][1]],
        self.mean_height[-1] / 3
    )

    # merge chars in the same rect
    for c in Recognizer.sort_X_firstly(