infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
18.18k stars 1.84k forks source link

[Bug]: Text location parsed from PDF isn't matched with the location parsed by OCR #2622

Open dhking opened 20 hours ago

dhking commented 20 hours ago

Is there an existing issue for the same bug?

Branch name

master

Commit ID

Other environment information

No response

Actual behavior

pdf 提取的文字坐标和ocr提取的文字坐标不一致导致无法匹配

Expected behavior

No response

Steps to reproduce

pdf 提取的文字坐标和ocr提取的文字坐标不一致导致无法匹配

        for c in Recognizer.sort_X_firstly(
                chars, self.mean_width[pagenum - 1] // 4):
            ii = Recognizer.find_overlapped(c, bxs)
            if ii == 24:
                print(c)
            if ii is None:
                self.lefted_chars.append(c)
                continue
            ch = c["bottom"] - c["top"]
            if c["text"] != " " and c["text"] != "":
                bxs[ii]["font_height"] = c["height"]
            bh = bxs[ii]["bottom"] - bxs[ii]["top"]
            if abs(ch - bh) / max(ch, bh) >= 0.7 and c["text"] != ' ':
                self.lefted_chars.append(c)
                continue
            if c["text"] == " " and bxs[ii]["text"]:
                if re.match(r"[0-9a-zA-Z,.?;:!%%]", bxs[ii]["text"][-1]):
                    bxs[ii]["text"] += " "
            else:
                #bxs[ii]["text"] += c["text"]
                box_result = get_char_item(c,ii)
                box_results.append(box_result)
        # 首先对数据按 "ii" 排序
        sorted_data = sorted(box_results, key=lambda x: x["ii"])
        # 使用 groupby 进行分组
        grouped_data = {k: list(v) for k, v in groupby(sorted_data, key=lambda x: x["ii"])}
        # 打印结果
        for ii, items in grouped_data.items():
            sort_item =Recognizer.sort_Y_firstly(
                items, self.mean_width[pagenum - 1] // 3)
            texts = [item["text"] for item in sort_item]
            combined_text = ''.join(texts)  # 使用空格连接文本
            bxs[ii]["text"] = combined_text
        for b in bxs:
            if not b["text"]:
                left, right, top, bott = b["x0"] * ZM, b["x1"] * \
                                         ZM, b["top"] * ZM, b["bottom"] * ZM
                b["text"] = self.ocr.recognize(np.array(img),
                                               np.array([[left, top], [right, top], [right, bott], [left, bott]],
                                                        dtype=np.float32))
            del b["txt"]
        bxs = [b for b in bxs if b["text"]]
        if self.mean_height[-1] == 0:
            self.mean_height[-1] = np.median([b["bottom"] - b["top"]
                                              for b in bxs])
        self.boxes.append(bxs)

Additional information

No response