Open jazzido opened 8 years ago
@jazzido do you mean to write tests against these documents in tabula-java? reading the table description from xml files?
I'll start translating icdar-test.rb to java
@jazzido I committed a test for one single file (eu-001.pdf) just to check with you if I'm correct.
I should read the bbox data from region xml file, extract the table/s using that bbox and then check if the cells are the same in structure xml file?. I added some comments in the test, I'm open to suggestions. Thanks!
Yes, the idea is to read the page areas from the XML files and run our extractor there. As you've seen, the ICDAR dataset has separate files for the page areas and for the structure of the ground-truthed tables. We need to use both of them: read the area from the *-reg.xml
and the structure from *-str.xml
.
After that's done, we should implement a comparison method between the ground-truth and our detection. This paper, by the organizers of the ICDAR competition, describes one.
Thanks so much for this, Meli. Having a comprehensive, independent test suite is a great step ahead for Tabula :)
Thanks @jazzido . So, the comparison is not as simple as I thought. I'll start with the first part: read xml files and extract tables. In the meantime I''ll read the paper to understand how the comparison should be. Looking forward to implement the algorithm!
I think that the implementation of the ICDAR test suite can be a separate project, so you're more than welcome to use Scala, JRuby, or whatever JVM language you're comfortable with.
Awesome! Scala is a very good choice :+1: Should we create a new repository or we commit the code here?
I'll create a new repo under tabulapdf
and give you access.
Here: https://github.com/tabulapdf/icdar-testsuite
You have committer access, no need to PR.
Great, thanks!
@melisabok I was going to try to incorporate some of the icdar data to set up tests for #49 - my plan was to have some tests for table region detection so that I could then test various table detection schemes.
Have you got anything going yet? I don't want to duplicate work if I don't have to. If not maybe I'll start writing table detection tests against tabula-java on my own and you can integrate them into this larger test suite later on?
@mcharters : writing table detection tests against tabula-java on my own and you can integrate them into this larger test suite later is a good option.
I'll commit my code on this new repo asap.
@melisabok Sounds good, thanks! That's what I'll do.
It's so exciting that you guys are tackling this. I can't stress enough how important it is to have an extensive test suite for Tabula. It'll allow us to track regressions in the quality of the extractor and stay focused on what needs to be improved.
I probably won't be able to contribute much code, but please don't hesitate to ping me if you need help understanding the codebase.
Thanks again!
@melisabok I added my initial table detection tests based on the ICDAR data in pull request #53 - feel free to steal any code you need! Maybe one day I'll learn some Scala and be able to help move everything over. :)
In 2013, there was a table extraction competition at the International Conference on Document Analysis and Recognition. Its organizers released a comprehensive dataset that contains a bunch of PDFs along with XML files describing both the position, size and cell structure of tables appearing in those documents.
Sometime ago, I started a branch in
tabula-extractor
with code that reads those descriptions and runs Tabula's extractors against the documents: https://github.com/tabulapdf/tabula-extractor/blob/icdar-groundtruth-tests/test/icdar-test.rbIt would be great to revive that effort.