jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

New Problem #745

Closed Godlikemandyy closed 1 year ago

Godlikemandyy commented 1 year ago

@jsvine I have a new situation. I solved the previous problem by adjusting y_tolerance, but I encountered the following problem with the new document. The document format is as follows:

image

These sorts of documents are in left-right columns, and when I used extract_text() noncustom y_tolerace, I was getting text that was in error lines, and some of the text looked like this: ` 第一章 财务管理概论   使用 会计云课堂 或 微信 扫码快速做题 对答案 看解析 “ ” App “ ” 、 、 、 掌握解题思路 开启轻松过关之旅 , 。 有利于企业资源的合理配置 一、 单项选择题 C. 反映创造利润与投入资本之间的关系 D.

  1. 下列各项财务管理环节中 与奖惩紧密 , 5. 下列各项措施中 不能协调股东和经营 联系 是贯彻责任制原则的要求 也是 , , , 者的利益冲突的是 构建激励与约束机制的关键环节的是 (    )。 通过市场约束经营者 A. (    )。 通过债权人约束经营者 财务决策 财务控制 B. A. B. 给予经营者一定的股票期权 财务分析 财务评价 C. C. D. 解聘经营者 D. ` When I adjust the parameter y_tolerace=7, the extracted text is as follows: image

In other words, when I magnify the y_tolerace value appropriately, I will solve some of the problem of dividing the text into two lines when it should be the same line, but at the same time I will extract the text from different lines into the same line, which will cause the content to be confused. I have tried to adjust y_tolerace to different values, but the situation has not been resolved. I would like to know how to solve this problem to get the text in basically the same format as the document. I strongly hope you can reply, thank you!

Godlikemandyy commented 1 year ago

Personally, I think it's the left and right column format that makes the spacing of each row small, and increasing y_tolerace will affect the format of the extracted content, but I don't know how to fix it.