tabulapdf / tabula-java

Extract tables from PDF files
MIT License
1.77k stars 412 forks source link

Write tests for the ICDAR 2013 groundtruth dataset #51

Open jazzido opened 8 years ago

jazzido commented 8 years ago

In 2013, there was a table extraction competition at the International Conference on Document Analysis and Recognition. Its organizers released a comprehensive dataset that contains a bunch of PDFs along with XML files describing both the position, size and cell structure of tables appearing in those documents.

Sometime ago, I started a branch in tabula-extractor with code that reads those descriptions and runs Tabula's extractors against the documents: https://github.com/tabulapdf/tabula-extractor/blob/icdar-groundtruth-tests/test/icdar-test.rb

It would be great to revive that effort.

melisabok commented 8 years ago

@jazzido do you mean to write tests against these documents in tabula-java? reading the table description from xml files?

I'll start translating icdar-test.rb to java

melisabok commented 8 years ago

@jazzido I committed a test for one single file (eu-001.pdf) just to check with you if I'm correct.
I should read the bbox data from region xml file, extract the table/s using that bbox and then check if the cells are the same in structure xml file?. I added some comments in the test, I'm open to suggestions. Thanks!

jazzido commented 8 years ago

Yes, the idea is to read the page areas from the XML files and run our extractor there. As you've seen, the ICDAR dataset has separate files for the page areas and for the structure of the ground-truthed tables. We need to use both of them: read the area from the *-reg.xml and the structure from *-str.xml.

After that's done, we should implement a comparison method between the ground-truth and our detection. This paper, by the organizers of the ICDAR competition, describes one.

Thanks so much for this, Meli. Having a comprehensive, independent test suite is a great step ahead for Tabula :)

melisabok commented 8 years ago

Thanks @jazzido . So, the comparison is not as simple as I thought. I'll start with the first part: read xml files and extract tables. In the meantime I''ll read the paper to understand how the comparison should be. Looking forward to implement the algorithm!

jazzido commented 8 years ago

I think that the implementation of the ICDAR test suite can be a separate project, so you're more than welcome to use Scala, JRuby, or whatever JVM language you're comfortable with.

melisabok commented 8 years ago

Awesome! Scala is a very good choice :+1: Should we create a new repository or we commit the code here?

jazzido commented 8 years ago

I'll create a new repo under tabulapdf and give you access.

jazzido commented 8 years ago

Here: https://github.com/tabulapdf/icdar-testsuite

You have committer access, no need to PR.

melisabok commented 8 years ago

Great, thanks!

mcharters commented 8 years ago

@melisabok I was going to try to incorporate some of the icdar data to set up tests for #49 - my plan was to have some tests for table region detection so that I could then test various table detection schemes.

Have you got anything going yet? I don't want to duplicate work if I don't have to. If not maybe I'll start writing table detection tests against tabula-java on my own and you can integrate them into this larger test suite later on?

melisabok commented 8 years ago

@mcharters : writing table detection tests against tabula-java on my own and you can integrate them into this larger test suite later is a good option.

I'll commit my code on this new repo asap.

mcharters commented 8 years ago

@melisabok Sounds good, thanks! That's what I'll do.

jazzido commented 8 years ago

It's so exciting that you guys are tackling this. I can't stress enough how important it is to have an extensive test suite for Tabula. It'll allow us to track regressions in the quality of the extractor and stay focused on what needs to be improved.

I probably won't be able to contribute much code, but please don't hesitate to ping me if you need help understanding the codebase.

Thanks again!

mcharters commented 8 years ago

@melisabok I added my initial table detection tests based on the ICDAR data in pull request #53 - feel free to steal any code you need! Maybe one day I'll learn some Scala and be able to help move everything over. :)