py-pdf / pypdf_table_extraction

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
15 stars 7 forks source link

[MRG] Utils: optimise get_page_layout #5

Closed karlowich closed 4 months ago

karlowich commented 4 months ago

Since the existing code overwrites layout and dim in each iteration, it is much more efficient to simply return the layout and dim of the first page.

I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds.

foarsitter commented 4 months ago

The current behavior indeed does not make any sense.

If it only returns the layout of the first page then I would suggest to rename it to get_layout_from_first_page or something similar.

The function is called in the _save_page method from PDFHandler. _save_page is called for each page. So I suggest adding cache for PDFHandler to.

Example:

from functools import lru_cache

class PDFHandler:
   ...
    @lru_cache
    def get_layout_from_first_page(self):
        return get_page_layout(self.filepath)

What do you think?

An extra bonus would be to add type hinting for the method. But that is something we didn't discus yet with the team.

karlowich commented 4 months ago

@foarsitter I looked at this code 2 years ago. To be honest I can't recall much about the code base, feel free to modify this PR in any way you see fitting :)

MartinThoma commented 4 months ago

@foarsitter @bosd How do you want to handle merging?

I would suggest that one of you takes the lead and merges (in general, not only for this PR). People need to see that this repo is active and changes get merged.

From my experience with taking over the maintenance of pypdf, my priorities would be:

  1. Ensure the test suite covers the important parts + CI runs / blocks merging if it fails
  2. Merge many things rather quickly in the beginning to show people that this repo is active again.
  3. Once the first batch of changes get lower, apply higher standards and potentially take more time discussing PRs.

One thing that you might want to use is the fact that pypdf-table-extraction is still in 0.x version. So a bit of instability might be ok.

Having said that: That's your project :-) I'm here to support, but it's fine for me if you develop/run pypdf-table-extraction in a different way :-)

bosd commented 4 months ago

How do you want to handle merging? Good question. Ideally I would like to see multiple approvals on a PR. Yet, even in some bigger projects who enforce a 2 approval system, I sadly see PR's go stale. So, my take on this would be rule of at least 1 approving review and passing tests.

I'm with you on all 3 points.

In regards to this PR. I have looked at the code changes, LGTM. Did'nt press approve of merge button due to no test mechanisms in place. As you mentiond, it is 0 version So preferring speed of implementation..

Implementing the (cookiecutter) / repo setup is a priority to me. Ironically, the earlier mentioned awesome-python cookiecutter template is in need of some maintainer :heart: as well. (Support for recent python versions have not been added.)

bosd commented 4 months ago

@karlowich Thanks for this PR. Speed results are impressive.

Would this change, alter the functionality? E.g. A PDF with an A4 size cover page and with follow up pages on A3 size which contain the tables. Will the pages after the cover page still be parsed at the A3 size?

karlowich commented 4 months ago

@bosd Yes this will change the functionality slightly. Before, the code would overwrite the dimensions for every page and ultimately return the dimensions of the last page. My code returns the dimensions of the first page.

Neither can correctly handle PDFs with different size pages.

foarsitter commented 4 months ago

@bosd this PR uses the layout of the first page instead of the last page. Either way it will be a problem if a pdf contains various sizes.

To be backwards compatible we need to use the last item of the collection I guess.

bosd commented 4 months ago

Neither can correctly handle PDFs with different size pages.

Thanks for the clarification. The support for handling multiple pages sizes might be something for a future PR. This one is ok for me now.

To be backwards compatible we need to use the last item of the collection I guess.

I don't see why this would cause backward compatability options. LGTM now.

foarsitter commented 4 months ago

I don't see why this would cause backward compatibility options. LGTM now.

Because we are moving from last page to first page, but that IMHO is a minor detail.

I will merge it :)

Thanks @karlowich for submitting your PR and your quick responses!