Closed yavuzKomecoglu closed 2 years ago
Thanks for filing! Given that this does not seem to be a bug with pdfplumber
specifically, I'm closing this issue in favor of the discussion you opened in #552. If this does turn out to be a bug, we can reopen this issue here.
Thank you for your interest. I expect support from the discussion section to understand the problem.
Describe the bug
Hello, in some Turkish newspaper PDFs, the word coordinates are below a certain pixel. For example, while the headlines in the news in 2_eylul_15_11_2018.pdf / page-3 and 2_eylul_30_11_2018.pdf / page-7 start from the bottom, the coordinates are obtained correctly in the news in gaziantep_dogus_10_08_2021.pdf / page-4. What exactly is the difference between them? Why does this occur? What's the difference with these pdf's that such coordinate information returns differently? Thanks.
Code to reproduce the problem
First, the words are extracted.
Then the title region is determined.
Note: Where title_area is absolute position, relative position is used when drawing title area with opencv.
PDF file
Test Newspapers
Environment