jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.75k stars 674 forks source link

Difference in word coordinate information #560

Closed yavuzKomecoglu closed 2 years ago

yavuzKomecoglu commented 2 years ago

Describe the bug

Hello, in some Turkish newspaper PDFs, the word coordinates are below a certain pixel. For example, while the headlines in the news in 2_eylul_15_11_2018.pdf / page-3 and 2_eylul_30_11_2018.pdf / page-7 start from the bottom, the coordinates are obtained correctly in the news in gaziantep_dogus_10_08_2021.pdf / page-4. What exactly is the difference between them? Why does this occur? What's the difference with these pdf's that such coordinate information returns differently? Thanks.

Code to reproduce the problem

First, the words are extracted.

data_words = crop.dedupe_chars().extract_words(x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"])           

Then the title region is determined.

h_left_x0_all = min([h["x0"] for h in title])
h_left_top_all = min([h["top"] for h in title])
h_right_x1_all = max([h["x1"] for h in title])
h_right_bottom_all = max([h["bottom"] for h in title])

title_area = (h_left_x0_all, h_left_top_all), (h_right_x1_all, h_right_bottom_all)

Note: Where title_area is absolute position, relative position is used when drawing title area with opencv.

PDF file

Test Newspapers

Environment

jsvine commented 2 years ago

Thanks for filing! Given that this does not seem to be a bug with pdfplumber specifically, I'm closing this issue in favor of the discussion you opened in #552. If this does turn out to be a bug, we can reopen this issue here.

yavuzKomecoglu commented 2 years ago

Thank you for your interest. I expect support from the discussion section to understand the problem.