jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Remove decimalizing (but let CLI adjust precision) #520

Closed jsvine closed 2 years ago

jsvine commented 3 years ago

This PR updates the slightly outdated work on the un-decimalize branch, making it compatible with the develop branch's changes since then. The main commit message from that branch:

Per discussion at https://github.com/jsvine/pdfplumber/discussions/346
and input from @ramcdona, this commit changes pdfplumber's behavior
regarding floating point numbers. Specifically, it removes all
conversion of floats to Decimal objects. This brings several advantages:

- Increased precision (where applicable)
- Decreased code complexity
- Increased performance (~10% speedup on test suite)
- Increased fidelity to `pdfminer.six` output

These seem to outweigh the disadvantages:

- Some tests break (but have been easily fixed) due to increased
  precision and/or floating point arithmetic artifacts
- Some users' scripts may also break, if they depend on strict equality
  testing, though these *should* also be easily fixable

Because some form of automatic rounding may still be desirable for the
pdfplumber CLI utility, the conversion methods (.to_csv, .to_json) have
been adjusted to accept a `precision` argument.

Tests are working with only minor revision, but edge-cases may warrant further scrutiny, as this will be a breaking change for some use-cases — especially those that try testing equivalence of two attributes via ==.

codecov[bot] commented 3 years ago

Codecov Report

Merging #520 (bfc9e13) into develop (e1d851a) will increase coverage by 0.46%. The diff coverage is 100.00%.

:exclamation: Current head bfc9e13 differs from pull request most recent head 87b947f. Consider uploading reports for the commit 87b947f to get more accurate results Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #520      +/-   ##
===========================================
+ Coverage    98.30%   98.76%   +0.46%     
===========================================
  Files           10       10              
  Lines         1236     1217      -19     
===========================================
- Hits          1215     1202      -13     
+ Misses          21       15       -6     
Impacted Files Coverage Δ
pdfplumber/pdf.py 94.36% <ø> (-0.08%) :arrow_down:
pdfplumber/table.py 100.00% <ø> (ø)
pdfplumber/cli.py 100.00% <100.00%> (ø)
pdfplumber/convert.py 100.00% <100.00%> (ø)
pdfplumber/display.py 92.94% <100.00%> (+3.70%) :arrow_up:
pdfplumber/page.py 100.00% <100.00%> (ø)
pdfplumber/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e1d851a...87b947f. Read the comment docs.