tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.85k stars 430 forks source link

StringIndexOutOfBoundsException while processing documents #37

Open schoch opened 9 years ago

schoch commented 9 years ago

Today, I had a couple of StringIndexOutOfBoundsExceptions in ObjectExtractor.isPrintable(...). The reason for that was/is a String with length 0. I spend some time with my debugger, but was not able to finally figure out if that is a bug in tabula or in my pdf documents.

So, before I go deeper inside the code... does anybody have any ideas about that topic? If it is just a bug in the isPrintable check, it would be easy to fix.

jeremybmerrill commented 9 years ago

Manuel's the expert, but it looks to me like an easy-enough-to-fix bug in ObjectExtractor.java: https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/ObjectExtractor.java#L426-L431

jazzido commented 9 years ago

isPrintable should not receive a 0-length string as the argument, that should probably be catched on or before [ObjectExtractor.processTextPosition](https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/ObjectExtractor.java#L322 <https://github.com/tabulapdf/tabula-java/blob/master/src/main/java/technology/tabula/ObjectExtractor.java#L322).

Can you reproduce the bug and write a failing test case?

schoch commented 9 years ago

Yes, I can do that. Hope that I find the time in the next few days.

jazzido commented 9 years ago

Thanks!