Closed fdq09eca closed 4 years ago
Hi @fdq09eca, PDFs are allowed to place characters outside of its mediabox (or cropbox). If you want to automatically remove them, you could run page_inside = page.within_bbox(page.bbox)
.
@jsvine will the mediabox adjust respectively when the page is cropped?
Yes, the .crop(...)
and .within_bbox(...)
methods automatically adjust the bbox
property: https://github.com/jsvine/pdfplumber/blob/3afd08620f345adbf60d5a21c1e201535745239f/pdfplumber/page.py#L315-L322
Yes, the
.crop(...)
and.within_bbox(...)
methods automatically adjust thebbox
property:
excellent. Thank you, it gives me the reason of the bug.
I have this colab. It produces the following result
But I would like to have
which is achievable if I change the definition of
df_char
fromto
you may see the
within_bbx
is changed fromwithin_bbx = normal_bbx_coord & normal_x1
towithin_bbx = normal_bbx_coord
The aims is to trim off the non-textual area I can not understand that there ischar
outside the page width, and peculiarly, whenever theremove_noise()
method is used,page.page.width
decreases. I think it is a bug that I produced myself when I only crop off the page bypdfplumber
, every text is still within thepage.width
, but when it comes to my class then this bug happens. I have been struggling and battling with it for long. Any suggestion will be appreciated=== UPDATE ===
it produces
you may see some of them are normal, but some of them are not. @jsvine, is there something wrong with the
df_char
? and why is it?