jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Update GitHub Actions workflow #253

Closed samkit-jain closed 4 years ago

samkit-jain commented 4 years ago

Issues still present:

  1. pandas >= 1.0 not available on Python 3.5. Is this to be considered a blocker and support for Python 3.5 be dropped or downgrade pandas version for Python 3.5?
  2. Since a matrix run, the code coverage report is published 4 times for the 4 tests that are run with the same codecov message being updated again and again. The code coverage would still remain the same for all 4 versions but still has a drawback that it becomes difficult for the reader to know that the codecov message they are seeing is from which run.
codecov[bot] commented 4 years ago

Codecov Report

Merging #253 into develop will increase coverage by 0.04%. The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #253      +/-   ##
===========================================
+ Coverage    97.36%   97.41%   +0.04%     
===========================================
  Files           10       10              
  Lines         1139     1160      +21     
===========================================
+ Hits          1109     1130      +21     
  Misses          30       30              
Impacted Files Coverage Δ
pdfplumber/display.py 81.36% <100.00%> (ø)
pdfplumber/table.py 100.00% <100.00%> (ø)
pdfplumber/utils.py 100.00% <100.00%> (ø)
pdfplumber/pdf.py 100.00% <0.00%> (ø)
pdfplumber/page.py 100.00% <0.00%> (ø)
pdfplumber/container.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 798a152...8d00356. Read the comment docs.

jsvine commented 4 years ago

Many thanks, @samkit-jain! This looks like a solid improvement. Re. the still-present issues:

pandas >= 1.0 not available on Python 3.5. Is this to be considered a blocker and support for Python 3.5 be dropped or downgrade pandas version for Python 3.5?

Thanks for raising this question. pandas is only a dev requirement, because it is used in a few tests. But, examining where it is used, pandas is not really essential to those tests. I think I can remove pandas entirely from the requirements and still maintain the current level of code coverage by adding just a few more small tests. I'll start a branch with those changes.

Since a matrix run, the code coverage report is published 4 times for the 4 tests that are run with the same codecov message being updated again and again. The code coverage would still remain the same for all 4 versions but still has a drawback that it becomes difficult for the reader to know that the codecov message they are seeing is from which run.

Ah, good catch. I am not very accustomed to using GitHub Actions, but it looks like could solve the problem: https://github.com/codecov/codecov-action/issues/40

Instead of using Makefile, specifying the command to run directly in the workflow step. @jsvine Why was Makefile used and what extra benefit is it providing?

I am using the Makefile to lint, format, and run tests locally. I figured it'd keep things simple to reflect that in the GitHub Action, but I am fine with the changes you've made to specify the commands more directly. Thanks for checking!

jsvine commented 4 years ago

Following up on the pandas question, here is a commit, on the new tests/remove-pandas branch, that makes these changes: https://github.com/jsvine/pdfplumber/commit/5abac62324fa1dacfcf893580b5d7f813a905c6f

Feel free to merge tests/remove-pandas into your PR branch if it suits you.

samkit-jain commented 4 years ago

@jsvine I have merged the pandas branch in this PR. The PR is now ready for review and the pandas branch can be deleted if this PR is merged in.

Additional changes made:

In future, when Python 3.5 support would be dropped (as of https://devguide.python.org/#status-of-python-branches it is reaching end of life very soon), all the string formatters can be updated to use f-strings.

Thoughts @jsvine ?

Also, in https://github.com/jsvine/pdfplumber/blob/develop/Makefile#L16-L25 pdfpumber is referring to the non-root folder pdfplumber or the root folder pdfplumber?

samkit-jain commented 4 years ago

@jsvine Python 3.5 tests still failing :(

  1. test_loading_pathobj is failing because pathlib works correctly with open only on Python 3.6+. See "Changed in version 3.6:" section in https://docs.python.org/3/library/functions.html#open
  2. test_cli is failing because as per https://docs.python.org/3/library/json.html#json.loads, json.loads() started accepting bytes only in Python 3.6.
  3. Unsure on why test_extract_words is failing.
jsvine commented 4 years ago

Ahh, thank you for catching this, Samkit, and my apologies for not examining 3.5 support earlier. I've now looked into the extract_words failure, and found that it relates to this line: https://github.com/jsvine/pdfplumber/blob/798a152156d400f39efbfec8b4a27603c9309f94/pdfplumber/utils.py#L274

Specifically, it's due to the new dict implementation introduced in Python 3.6, which resulted in order-preserving dictionaries: https://docs.python.org/3/whatsnew/3.6.html#new-dict-implementation

Changing chars_by_upright = {1: [], 0: []} to chars_by_upright = OrderedDict([ (1, []), (0, []) ]) makes the test pass in 3.5. (Without that change, 3.5 returns vertically-oriented words first, instead of horizontally oriented ones.)

Given the following:

... I think pdfplumber ought to drop support for 3.5. What do you think?

If you agree, can you roll back 89678ed and remove 3.5 from the test matrix, README, and setup.py? (Again, apologies for not catching this earlier 😬 .) I think we can, however, keep the change that remove pandas from the tests and dev requirements.

samkit-jain commented 4 years ago

No problem @jsvine I am glad that the PR could help in identifying the issues with Python 3.5. I am 👍🏻 with dropping support for Python 3.5 and will make the necessary changes. Just to confirm, when you say "roll back https://github.com/jsvine/pdfplumber/commit/89678ed3fe8d9271741b6bba0a397cd259f58024" are you referring to adding another commit reverting the change or dropping the commit altogether from the commit history?

jsvine commented 4 years ago

Great, and thanks! Reviewing the commit history again, it seems a8f15c1 is no longer necessary, either. So I'd suggest running git reset --hard c4b2417 and then force-pushing to this branch. Open to other proposals, however, if you'd rather not force-push.

samkit-jain commented 4 years ago

@jsvine Done with the changes. Request your review again.