tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.83k stars 427 forks source link

Regression from 0.9.2: empty response #171

Closed jazzido closed 7 years ago

jazzido commented 7 years ago

Reported by @jeremybmerrill in tabulapdf/tabula#707:

I'm getting empty responses on a PDF that should work (and do work off of master) from this. For several different selections that work on master, I get this response for each, regardless of the extraction method I choose: [{"extraction_method":"","top":0.0,"left":0.0,"width":0.0,"height":0.0,"data":[[{"top":0.0,"left":0.0,"width":0.0,"height":0.0,"text":""}]], "spec_index": 0}]. The params to the request look right (and do work right, as I said, on master).

Test document: http://www1.nyc.gov/assets/nypd/downloads/pdf/crime_statistics/cs-en-us-pbms.pdf

0.9.2

java -jar ~/Downloads/tabula-0.9.2-jar-with-dependencies.jar -a 193.163,34.425,333.158,365.67 cs-en-us-pbms.pdf
"",,,Crime Complaints
"",Week to Date,,28 Day Ye
"",2017 2016 % Chg,2017,2016 % Chg 2017
Murder,0 0 ***.*,0,1 -100.0 6
Rape,1 3 -66.7,14,13 7.7 75
Robbery,16 25 -36.0,85,100 -15.0 540
Fel. Assault,29 26 11.5,102,116 -12.1 773
Burglary,20 29 -31.0,78,111 -29.7 621
Gr. Larceny,201 201 0.0,831,"888 -6.4 5,261"
G.L.A.,7 13 -46.2,23,41 -43.9 118

Debug output

java -cp ~/Downloads/tabula-0.9.2-jar-with-dependencies.jar technology.tabula.debug.Debug -e cs-en-us-pbms.pdf

image

1.0.0

java -jar ~/Downloads/tabula-1.0.0-jar-with-dependencies.jar -a 193.163,34.425,333.158,365.67 cs-en-us-pbms.pdf
""

Debug output

java -cp ~/Downloads/tabula-1.0.0-jar-with-dependencies.jar technology.tabula.debug.Debug -e cs-en-us-pbms.pdf

image

jazzido commented 7 years ago

Started work on branch fix/171. Seems that the y coordinates of the extracted TextElements are off (fall outside of the Page boundaries)

jazzido commented 7 years ago

Fixed for the reference document (acfc2ef5cdc6a9b509ff7d45cd4b14ffe5958113), all tests pass except for an RTL case (@jeremybmerrill can you take a look? my arabic is nil)

cs-en-us-pbms-1

jeremybmerrill commented 7 years ago

It's a good test failure. There was a problem in previous versions with a misplaced diacritic; with this new version on this branch, the diacritic is in the right place.

jazzido commented 7 years ago

sweet. Should I just adjust the expectation in the test with the actual value?

jeremybmerrill commented 7 years ago

I already did but forgot to push because I'm juggling a million things. When my computer is done updating, I'll push.

Jeremy B. Merrill Sent from my mobile device

On Jul 28, 2017 12:03 PM, "Manuel Aristarán" notifications@github.com wrote:

sweet. Should I just adjust the expectation in the test with the actual value?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tabulapdf/tabula-java/issues/171#issuecomment-318693383, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhdmjYSOEwueZHaEG-GUzcNcQySBrE6ks5sSgXmgaJpZM4OkdgI .

jazzido commented 7 years ago

Nevermind, just did it here.

jeremybmerrill commented 7 years ago

Ha, okeedoke. Works for me.

Jeremy B. Merrill Sent from my mobile device

On Jul 28, 2017 12:11 PM, "Manuel Aristarán" notifications@github.com wrote:

Nevermind, just did it here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tabulapdf/tabula-java/issues/171#issuecomment-318696175, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhdmnTXG4Mj7gqSQggEJjFgrVTjvo31ks5sSghIgaJpZM4OkdgI .