Open ollynowell opened 3 months ago
@ollynowell Thanks for your pr, can you rebase with master?
Is it possible to share the example file?
@ollynowell Thanks for your pr, can you rebase with master?
Is it possible to share the example file?
No unfortunately it's not possible, it is a client's file that I can't share.
No unfortunately it's not possible, it is a client's file that I can't share.
Would it be possible to anonymize the data in the file? Or recreate a file with a similar structure?
No unfortunately it's not possible, it is a client's file that I can't share.
Would it be possible to anonymize the data in the file? Or recreate a file with a similar structure?
I can look into it if there is a good reason to - what do you need it for?
I can look into it if there is a good reason to - what do you need it for?
It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case.
The result will be that we will gradually increase the accuracy of this tool.
I can look into it if there is a good reason to - what do you need it for?
It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case.
The result will be that we will gradually increase the accuracy of this tool.
Sounds reasonable, I'll look into it
@bosd Here is a minimal example!
Without the fix in this PR, this is the behaviour:
tables = camelot.read_pdf("Bug Minimal Example.pdf", pages="all", flavor="stream")
>>> print(tables[0].df)
0 1 2 3 4
0 Outer 1 Outer 2 Long 1 Long 2 Inner \nLong 3
1 A1 B1 C1 D1 E1 \nF1
2 C2 D2 F2
3 C3 D3 F3
4 C4 D4 F4
The column Inner
is combined with the column Long 3
.
The reason for this is the following:
stream._generate_columns_and_rows
the first call to _group_rows
finds 5 rows.elements
variable is populated with the row lengths: [5, 5, 3, 3, 3]
ncols = max(set(elements), key=elements.count)
gives 3 columnsInner
and
Outer 2`inner_text + outer_text
is then passed into _add_columns
and subsequently _group_rows
again, without being sortedInner
columnOuter
columns_add_columns
then selects just just the columns from the rows of length two, and therefore only adds the Outer
columns and not the Inner
columnWith the fix the file is read correctly:
tables = camelot.read_pdf("Bug Minimal Example.pdf", pages="all", flavor="stream")
>>> print(tables[0].df)
0 1 2 3 4 5
0 Outer 1 Outer 2 Long 1 Long 2 Inner Long 3
1 A1 B1 C1 D1 E1 F1
2 C2 D2 F2
3 C3 D3 F3
4 C4 D4 F4
The difference is that at step 8-9 above, having sorted inner_text + outer_text
, two rows are found of length 3, consisting of the columns Inner, Outer 1, Outer 2
This does seem like it is a genuine bugfix (so please let's merge it soon!!) - I can't see any reason why the code would be set up to either add inner columns or outer columns but not a combination.
However it is still possible to construct examples that don't work, although it is somewhat unlikely since it requires the headers to not be aligned in addition to the column values.
@ollynowell Many Thanks for your detailed write-up and examples. The bugfix seems valid!
Ideally, the files would be used in unit tests, to keep track on the performance of this lib to prevent futer regressions. But Benchmarking this lib is a huge topic, to be implemented. Would you be able to include a test in this pr? :blush:
FWIW, I'm investigating / working on the network/hybrid parser #90 .
I used the file Further Working Example
, got a reasonable result with network parser.
Still not perfect..
@bosd thanks for your prompt reply - perhaps there is still hope for this library after all!
Would you be able to include a test in this pr? 😊
Done :)
FWIW, I'm investigating / working on the network/hybrid parser https://github.com/py-pdf/pypdf_table_extraction/issues/90 . I used the file Further Working Example, got a reasonable result with network parser. Still not perfect..
So it's missing the E5
value at the bottom? Is that a correct interpretation of those diagrams?
To be honest I don't understand why the stream logic limits the columns that are added in this way, rather than just adding all of them. But I haven't tested and any further changes in that direction would need a lot of checking against files of varying complexity, and would have a lot more risk of breaking things... as you say benchmarking this library is a huge topic!
So it's missing the E5 value at the bottom? Is that a correct interpretation of those diagrams? Yes, that is the correct interpretation.
Those diagrams are made on a 4year old fork. Which did'nt inculde some of the fixes on the main dev branch. In the meantime, the histroy accross all those forks have diverged a lot. I'm currently working on porting the changes over. Which involves a lot of manual editing. But the new parsers look very promising.
Will have to check later on, when my porting is done if E5 is correctly included.
The last commits introduced some merge conflicts. Will look into it at a later time.
Original PR in camelot-dev here
PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.
In
_generate_columns_and_rows
this is done correctly the first time: Line 328:t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))
But
inner_text
is not sorted at any point after being extended withouter_text
, which means that the_group_rows
algorithm does not always work correctly - motivating example below:Motivating Example My table looks like this:
Initial column detection identified just three columns (because those columns being longer pushed the mode to three)
inner_text
was then populated by this column:and subsequently extended with the
outer_text
from here:Because
inner_text
isn't sorted after being extended,_group_rows
first finds rows of length 1 from the oneinner_text
columns, and then finds longer rows from theouter_text
column. As a result theinner_text
column isn't added.The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.