jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

_itemgetter function removed from utils.py without deprecationWarning #691

Closed jfuruness closed 2 years ago

jfuruness commented 2 years ago

Describe the bug

A clear and concise description of what the bug is. I had code that called the cluster_objects function in utils.py. I passed in a string to this function for the key_fn parameter, and I guess previously there was a function in here called _itemgetter that converted that string into an itemgetter function. During commit 58b1ab1b0aa04431c49b25e1751024ae2444218c, this was removed, breaking my code. While I have fixed this in my own code, it might be nice to add that function back and have a deprecation warning for a bit so that others can update their code without it breaking on them

Code to reproduce the problem

Paste it here, or attach a Python file. It's part of a much larger package, but hopefully the description describes it clearly

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

N/A

Expected behavior

What did you expect the result should have been?

I expected passing a string for the key_fn parameter in the cluster_objects function would work normally, instead it errored out (and I had to change the key_fn parameter from a string to an itemgetter function)

Actual behavior

What actually happened, instead? The string caused a crash

Screenshots

If applicable, add screenshots to help explain your problem. N/A

Environment

Additional context

Add any other context/notes about the problem here. N/A

jsvine commented 2 years ago

Hi @jfuruness, and sorry to hear about the inconvenience that changed caused you. In the convention of PEP 8, I was using the leading underscore to indicate that _itemgetter was an internal-use method:

In addition, the following special forms using leading or trailing underscores are recognized (these can generally be combined with any case convention):

  • _single_leading_underscore: weak “internal use” indicator. E.g. from M import * does not import objects whose names start with an underscore.

[...]

That said, I could/should have been clearer about removing it — perhaps by giving the removal its own commit and note in the changelog. I've now updated the changelog now to reflect that 👍

Generally speaking, I'd like to preserve the ability to add/remove internal-use/leading-underscore methods with minimal friction, so I probably won't reinstate the method here. But thank you for the note, which has served as a useful reminder to flag such changes more prominently.

jfuruness commented 2 years ago

All sounds great.

Just to clarify, I wasn't personally using the _itemgetter function, the public cluster_objects function was. Previously the public cluster_objects function had a parameter called key_fn, and you could pass a string for this parameter. After the commit 58b1ab1, if you passed a string into this parameter (key_fn for the public function cluster_objects), it would error out.

Not that it matters, closing the issue is totally fine with me, I just figured other people calling the cluster_objects function with strings as parameters might face a similar issue, but I guess you would have heard by now :)

Thank you so much for everything you do for this library. Is there any way we can support you for all the work you do on this repo?

jsvine commented 2 years ago

Ah, my apologies for misunderstanding the initial issue. And yes, you're right: That's a breaking change in a public method, and I'll consider ways to reinstate that functionality. (I vaguely recall trying to simplify things for type-consistency's sake, but I'll reassess.) Thanks for clarifying.

Thank you so much for everything you do for this library. Is there any way we can support you for all the work you do on this repo?

Thank you for the kind words, and for the offer! Depending on the kind of support you have in mind:

But no pressure on those at all. Opening thoughtful issues like these is a great form of support.

jfuruness commented 2 years ago

No problem at all, I think I did not explain it well the first time around. I agree I think the type consistency is indeed better after the changes.

As far as fixing the memory leakage problem, I hate to dissapoint but I am working a job and a PhD, I won't have much free time until about a month from now. I'll try to take a cursory look but unfortunately I don't think I'll have the time required to solve this issue promptly.

If I know of anyone attempting to parse PDFs though I will definitely recommend you! I've tried several other PDF libraries, this was the best I could find by far

jsvine commented 2 years ago

Thanks again for flagging this, @jfuruness. After thinking it through a bit more, I've just gone and reinstated the ability to use any hashable value (including but not limited to str) as the key_fn parameter to .cluster_objects(...). Now available on stable, develop, and v0.7.5. Cheers!

jfuruness commented 2 years ago

No thank you! I'm honored to be a part of the changelog on such a great repo that is so well maintained. Best wishes!