py-pdf / pypdf_table_extraction

A Python library to extract tabular data from PDFs
https://pypdf-table-extraction.readthedocs.io
MIT License
36 stars 13 forks source link

Bugfix for Stream._group_rows #19

Open ollynowell opened 3 months ago

ollynowell commented 3 months ago

Original PR in camelot-dev here

PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.

In _generate_columns_and_rows this is done correctly the first time: Line 328: t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))

But inner_text is not sorted at any point after being extended with outer_text, which means that the _group_rows algorithm does not always work correctly - motivating example below:

Motivating Example My table looks like this: image

Initial column detection identified just three columns (because those columns being longer pushed the mode to three) image

inner_text was then populated by this column: image

and subsequently extended with the outer_text from here: image

Because inner_text isn't sorted after being extended, _group_rows first finds rows of length 1 from the one inner_text columns, and then finds longer rows from the outer_text column. As a result the inner_text column isn't added.

The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.

bosd commented 1 month ago

@ollynowell Thanks for your pr, can you rebase with master?

Is it possible to share the example file?

ollynowell commented 1 month ago

@ollynowell Thanks for your pr, can you rebase with master?

Is it possible to share the example file?

No unfortunately it's not possible, it is a client's file that I can't share.

bosd commented 1 month ago

No unfortunately it's not possible, it is a client's file that I can't share.

Would it be possible to anonymize the data in the file? Or recreate a file with a similar structure?

ollynowell commented 1 month ago

No unfortunately it's not possible, it is a client's file that I can't share.

Would it be possible to anonymize the data in the file? Or recreate a file with a similar structure?

I can look into it if there is a good reason to - what do you need it for?

bosd commented 1 month ago

I can look into it if there is a good reason to - what do you need it for?

It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case.

The result will be that we will gradually increase the accuracy of this tool.

ollynowell commented 1 month ago

I can look into it if there is a good reason to - what do you need it for?

It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case.

The result will be that we will gradually increase the accuracy of this tool.

Sounds reasonable, I'll look into it

ollynowell commented 2 weeks ago

Bug Minimal Example.pdf

image

@bosd Here is a minimal example!

Without the fix

Without the fix in this PR, this is the behaviour:

tables = camelot.read_pdf("Bug Minimal Example.pdf", pages="all", flavor="stream")

>>> print(tables[0].df)
         0        1       2       3               4
0  Outer 1  Outer 2  Long 1  Long 2  Inner \nLong 3
1       A1       B1      C1      D1         E1 \nF1
2                        C2      D2              F2
3                        C3      D3              F3
4                        C4      D4              F4

The column Inner is combined with the column Long 3.

Explanation

The reason for this is the following:

  1. In stream._generate_columns_and_rows the first call to _group_rows finds 5 rows.
  2. The elements variable is populated with the row lengths: [5, 5, 3, 3, 3]
  3. Since 3 is the modal row length, ncols = max(set(elements), key=elements.count) gives 3 columns
  4. These 3 columns are Long 1, Long 2, Long 3
  5. inner_text is populated with the two rows from the column Inner
  6. outer_text is populated with the two rows from the columns 'Outer 1andOuter 2`
  7. inner_text + outer_text is then passed into _add_columns and subsequently _group_rows again, without being sorted
  8. Since it isn't sorted, the first two rows are each just one element long, consisting of just the Inner column
  9. Then two more rows are found each of length two, consisting of the two Outer columns
  10. _add_columns then selects just just the columns from the rows of length two, and therefore only adds the Outer columns and not the Inner column

With the fix

With the fix the file is read correctly:

tables = camelot.read_pdf("Bug Minimal Example.pdf", pages="all", flavor="stream")

>>> print(tables[0].df)
         0        1       2       3      4       5
0  Outer 1  Outer 2  Long 1  Long 2  Inner  Long 3
1       A1       B1      C1      D1     E1      F1
2                        C2      D2             F2
3                        C3      D3             F3
4                        C4      D4             F4

The difference is that at step 8-9 above, having sorted inner_text + outer_text, two rows are found of length 3, consisting of the columns Inner, Outer 1, Outer 2

Further Work

This does seem like it is a genuine bugfix (so please let's merge it soon!!) - I can't see any reason why the code would be set up to either add inner columns or outer columns but not a combination.

However it is still possible to construct examples that don't work, although it is somewhat unlikely since it requires the headers to not be aligned in addition to the column values.

e.g. Further Work Example.pdf

image
bosd commented 1 week ago

@ollynowell Many Thanks for your detailed write-up and examples. The bugfix seems valid!

Ideally, the files would be used in unit tests, to keep track on the performance of this lib to prevent futer regressions. But Benchmarking this lib is a huge topic, to be implemented. Would you be able to include a test in this pr? :blush:

FWIW, I'm investigating / working on the network/hybrid parser #90 . I used the file Further Working Example, got a reasonable result with network parser. Still not perfect.. image

ollynowell commented 1 week ago

@bosd thanks for your prompt reply - perhaps there is still hope for this library after all!

Would you be able to include a test in this pr? 😊

Done :)

FWIW, I'm investigating / working on the network/hybrid parser https://github.com/py-pdf/pypdf_table_extraction/issues/90 . I used the file Further Working Example, got a reasonable result with network parser. Still not perfect..

So it's missing the E5 value at the bottom? Is that a correct interpretation of those diagrams?

To be honest I don't understand why the stream logic limits the columns that are added in this way, rather than just adding all of them. But I haven't tested and any further changes in that direction would need a lot of checking against files of varying complexity, and would have a lot more risk of breaking things... as you say benchmarking this library is a huge topic!

bosd commented 1 week ago

So it's missing the E5 value at the bottom? Is that a correct interpretation of those diagrams? Yes, that is the correct interpretation.

Those diagrams are made on a 4year old fork. Which did'nt inculde some of the fixes on the main dev branch. In the meantime, the histroy accross all those forks have diverged a lot. I'm currently working on porting the changes over. Which involves a lot of manual editing. But the new parsers look very promising.

Will have to check later on, when my porting is done if E5 is correctly included.

bosd commented 1 week ago

The last commits introduced some merge conflicts. Will look into it at a later time.