tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.35k stars 9.41k forks source link

[Feature Request] Table structure extraction at the API #1714

Open troplin opened 6 years ago

troplin commented 6 years ago

There is already some table detection mechanism in tesseract but unfortunately, there is seems to be no possibility to access the table structure at the API.

This could be done only minimal changes to the API, just by expanding the PageIteratorLevel enum by two additional members RIL_TABLEROW and RIL_TABLECELL or similar. Those would only be relevant inside PT_TABLE blocks, just like PT_PARAGRAPH is only meaningful for text blocks.

zdenop commented 6 years ago

Are you able to send PR for this including simple test case (similar to #1614)?

troplin commented 6 years ago

@zdenop I didn't mean to imply that it was easy to implement, just that the interface changes are small. I have honestly no idea what it takes but if I find the time, I'll give it a try.

amitdo commented 6 years ago

I assume tesseract handle tables in one of these two ways:

1) Tables columns are held in tesseract blocks and cells are held as lines within blocks. 2) Tables rows are held in tesseract blocks and cells are held as lines within blocks.

I bet on option (1).

troplin commented 6 years ago

@amitdo I'm pretty sure that this is not the case. I fear that the information is lost completely. IMO it's also not a very good representation. Cells can be multi-line, so they should be comparable to paragraphs, not lines.

The table I'm testing with seems to be recognized as a single block (which makes sense IMO). But then the table is split into two paragraphs (one for the first row and one for the rest), which does not make much sense. The lines span the whole table. For multiline cells, the lines of each cell are combined into nonsensical long lines.

If this reflects the internal table structure, that would mean that the table detection is really bad and I can just disregard it. If not, the results could be presented much better. The fact that the table separators are actually recognized as horizontal and vertical lines makes me think that the information might be there.

I'm going to investigate a bit more, once I've successfully set up the debug viewer.

amitdo commented 6 years ago

Tesseract considers any table it can recognize as block, so it's neither of the cases.

amitdo commented 6 years ago

The table detection code is here: https://github.com/tesseract-ocr/tesseract/blob/master/src/textord/tablefind.cpp

amitdo commented 6 years ago

Play with the variables: https://github.com/tesseract-ocr/tesseract/blob/509a6f0ce0e636a9ed92553439f1ed6a56b346c5/src/textord/tablefind.cpp#L143

amitdo commented 6 years ago

They published a paper about the table detection module. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.638.7400&rep=rep1&type=pdf

Shreeshrii commented 6 years ago

https://github.com/tabulapdf/tabula/issues/409#issuecomment-327050906

from someone who has scripted the process of splitting a PDF into cells and OCR'ing them separately:

chain pdf-table-extract, ghostscript and tesseract

Shreeshrii commented 6 years ago

Related issue - How to detect table region after the update in Tablefind.cpp? #825

Shreeshrii commented 6 years ago

Also see https://github.com/DanBloomberg/leptonica/blob/5dca24f9674c7fd057ab55bbfc71efa87a83a520/version-notes.html#L180

 Improved table detection on scanned page images (tests: pageseg_reg.c)

https://github.com/DanBloomberg/leptonica/commit/18342b4c3fe64958804d8b6f042e81d264c2d4d3

troplin commented 6 years ago

Thanks for all the pointers. I don't want to change the table detection though, or even implement it myself. I just want the results accessible at the public API, if available.

Is there a high-level description of the internal processing pipeline of tesseract somewhere?

troplin commented 6 years ago

Ok, I've got the debug viewer running.

It seems, that the table detection works perfectly: table_structure

But then, the contents of the table are just processed as any other text, which doesn't make sense to me: final_table_partitions

So, this means that the data is actually there, but it's not actually used. Is this, because the whole table is a simple block? Would it be better to treat every cell as single block and represent the table structure on a higher level?

Sintun commented 6 years ago

Hi, I'm thinking about writing the API / structure part necessary to hold and access the otherwise lost information (as described by @troplin ). Is the approach described in @troplin s first comment feasible? Would a commit from a tesseract-team outsider be acceptable? Is there a Guideline for your c++ code-styling? And: Is this the right place to ask these questions?

amitdo commented 6 years ago

Hi @Sintun!

I'm thinking about writing the API / structure part necessary to hold and access the otherwise lost information (as described by @troplin ). Is the approach described in @troplin s first comment feasible?

I don't know.

Would a commit from a tesseract-team outsider be acceptable?

Yes, of course. https://github.com/tesseract-ocr/tesseract/graphs/contributors AFAIK, only 2 people in this list are from Google.

Whether a specific PR will be accepted or rejected will depend on it's code quality.

Don't break an existing API.

Is there a Guideline for your c++ code-styling?

Not officially, but since most of the code comes from Google, it's a good idea to use Google C++ Style Guide.

Is this the right place to ask these questions?

I think so. .

Sintun commented 6 years ago

Nice, I will start this and a traceback and possible fix of https://github.com/tesseract-ocr/tesseract/issues/1712 as a weekend-project.

amitdo commented 6 years ago

I suggest that you provide an example that demonstrates the use of the new API.

Good luck!

troplin commented 6 years ago

@Sintun It might actually be better to invert my API suggestion and do something like:

instead of my earlier suggestion, which was:

I don't know which one is better, it depends on what a block means to the engine. I already tried to figure out how the recognition process works on a high level, but I'm a bit lost.

Maybe someone with a deeper understanding of the internals could give a hint?

amitdo commented 6 years ago

Maybe someone with a deeper understanding of the internals could give a hint?

I think only @theraysmith can help here.

Sintun commented 6 years ago

@troplin At a first sight your new suggestion looks more logical. I hope that the existing structures will give me a clear path to hold the table information, otherwise i will stick with the most logical (at least for me) non-API breaking approach. Thanks for the hint, I will take it into account when looking at the code.

@amitdo I think i have to invest some time and dive through the surrounding code before i can understand helping hints of people who are familiar with the code base.

Shreeshrii commented 6 years ago

@zdenop Please label as Feature Request.

zdenop commented 6 years ago

@Sintun : any progress on this issue? Is API needs to be changes I would like to get it for 4.0 release...

Sintun commented 6 years ago

@Sintun : any progress on this issue? Is API needs to be changes I would like to get it for 4.0 release...

Unfortunately not yet, i'm still working on #1192 / #1712 , because it seems to be a more pressing matter.

krishna11888 commented 5 years ago

what happened to table extractor feature api

Sintun commented 5 years ago

Hi, I wanted to make the information accessible through the api. Unfortunately i wasn't able to find enough free time to do it. Fortunately my employer needs this feature, and I'm going to propose (next week) using this tesseract code and updating the tesseract api. I'm pretty confident that this will go through, so i could focus on it, which should enable me to finish it in the near future.

zdenop commented 5 years ago

Just side note: there is python project for extracting table data from pdf: Camelot and there is also web interface for it: excalibur

Sintun commented 5 years ago

Hi, today i started looking into this. I'm working on 5.0.0-alpha.

The found table rectangles are already exposed by the api, and I am not entirely shure yet, that the table structure isn't.

The Table - boundaries can be extracted through:

api-> SetVariable("textord_tabfind_find_tables", "true");
api-> SetVariable("textord_tablefind_recognize_tables", "true");

which needs to be set before the recognition. Then a loop over the blocks gives the boundaries:

tesseract::ResultIterator* ri = api->GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_BLOCK;

if (ri != 0) {
  do {
    if( ri->BlockType() == PT_TABLE )
    {
      printf("found a table\n");
      int x1, y1, x2, y2;
      ri->BoundingBox(level, &x1, &y1, &x2, &y2);
      printf("table BoundingBox: %d,%d,%d,%d;\n", x1, y1, x2, y2);
    }
  } while (ri->Next(level));
}
amitdo commented 5 years ago

The found table rectangles are already exposed by the api, and I am not entirely shure yet, that the table structure isn't.

It isn't. That's the whole point of this feature request.

Sintun commented 5 years ago

Jep, I just wanted to say, that I'm still trying to understand the code, and hadn't reached the point where i understand all the side effects. Introducing a structure like troplin proposed seems to be difficult, because a paragraph or a textline (even words) can go over table cell boundaries. At least at a word level, it is easy to fix them to a single table cell.

In Order to use iterators many things would need to be introduced in order to properly iterate over table & table parts and make funktions like GetUTF8Text available. Without using lists in the table cells, that hold pointers to the contained words, this could also become inefficient.

So a question: Should tesseract offer iterators over tables (like RIL_WORD, RIL_TEXTLINE ...), table columns and table rows ? Or would it be enough, to just add a function like

tesseract::ResultIterator* ri;
ri->TablePosition( &table_num, &table_row, &table_col );

And then the api user can decide what to do with that. It would be simple to do the second one. Then i Could just add these three infos to the WERD_RES class as

int8_t table;
int8_t table_row;
int8_t table_col;

and fill them after the table recognition.

troplin commented 5 years ago

Introducing a structure like troplin proposed seems to be difficult, because a paragraph or a textline (even words) can go over table cell boundaries. At least at a word level, it is easy to fix them to a single table cell.

IMO lines and paragraphs should only be detected within a table cell. It does not make sense for a line or paragraph to cross cell boundaries. Just look at the pictures that I posted above, the first one (table structure) looks fine, but in the second one (final table partitions) the content of different cells is combined to a single structure, which just doesn't make any sense to me.

I don't know exactly how the engine actually works, but I imagine that there's a layout analysis that comes first, dividing the image into blocks of different types (text, image, table, etc). After that, each block is processed separately depending on the type. The problem is IMO that a table is a single block and processed as a whole. I think it would be better to create a block for each cell in the first place. That way all the problems with structures that cross cell boundaries can be avoided.

amitdo commented 5 years ago

I think that implementing what @troplin asked for will be a very hard task.

It would be simple to do the second one. Then i Could just add these three infos to the WERD_RES class as

int8_t table;
int8_t table_row;
int8_t table_col;

Can you save the tables info in a vector of structs/classes of tables inside WERD_RES?

std:vector<MyTable> table;
struct MyTable {
  TBOX box;
  std:vector<TBOX> row;
  std:vector<TBOX> col;
}; 
Sintun commented 5 years ago

WERD_RES seems to be responsible for word - level information. So adding table information on this level would result into every word holding the information about every table. Nonetheless at the next weekend i will take a look for a hierarchy level, where this information can be stored and accessed. Creating a vector of table structs really warms my c++ heart.

amitdo commented 5 years ago

You're right about WERD_RES.

It seems that with my suggestion you need to place the vector of tables in PAGE_RES.

balachandarsv commented 5 years ago

Any update on this API? Is anyone working on this?

CarlosVilla00896 commented 4 years ago

have the same question as @balachandarsv, any update on this?

Sintun commented 4 years ago

I'm sorry, I wasn't able to find the time necessary to implement the necessary changes. So no, not from me.

Sintun commented 4 years ago

I made experimental changes that are sufficient to get the table information through the api. My understanding of the internal iterators and lists is currently not good enough, so i used a singleton approach to "warp" the information from the page segmentation to the api and hopefully i catched the reskewing and coordinate inversion steps. This approach probably fails on right-to-left languages, because I haven't taken them into account yet. I forked the repo and my changes can be found in https://github.com/Sintun/tesseract.

A minimal example to get it to work:

#include <tuple>
...
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
for(unsigned i = 0; i < api->GetNumberOfTables(); ++i)
{
  int x1, y1, x2, y2;
  std::tie(x1, y1, x2, y2) = api->GetTableBoundingBox(i);
  printf("table BoundingBox: %d, %d, %d, %d;\n", x1, y1, x2, y2);
  std::vector<std::tuple<int,int,int,int>> rows = api->GetTableRows(i);
  std::vector<std::tuple<int,int,int,int>> cols = api->GetTableCols(i);

  for(const std::tuple<int, int, int, int>& t: rows)
  {
    std::tie(x1, y1, x2, y2) = t;
    printf("row: %d, %d, %d, %d;\n", x1, y1, x2, y2);
  }
  printf("\n");
  for(const std::tuple<int, int, int, int>& t: cols)
  {
    std::tie(x1, y1, x2, y2) = t;
    printf("col: %d, %d, %d, %d;\n", x1, y1, x2, y2);
  }
  printf("\n");
}
Sintun commented 4 years ago

My small (but not minimal) program for testing: https://github.com/Sintun/PersonalHelperPrograms/blob/master/Tesseract/tess.cpp [outdated] new: tableExtractionDemo

foaadnami commented 4 years ago

+1

saiprasadjnv commented 3 years ago

Ok, I've got the debug viewer running.

It seems, that the table detection works perfectly: table_structure

But then, the contents of the table are just processed as any other text, which doesn't make sense to me: final_table_partitions

So, this means that the data is actually there, but it's not actually used. Is this, because the whole table is a simple block? Would it be better to treat every cell as single block and represent the table structure on a higher level?

Hi troplin, Can you help me understand how I can get the debug viewer running? Also, do we have any sample codes on table detection using tesseract?

zdenop commented 3 years ago

https://github.com/tesseract-ocr/tessdoc/blob/master/ViewerDebugging.md

troplin commented 3 years ago

@saiprasadjnv TBH I don't know exactly what I've done anymore. I guess I just followed the instructions linked by @zdenop .

saiprasadjnv commented 3 years ago

Thank you so much. I got it running.

saiprasadjnv commented 3 years ago

Is this feature request still open? I am interested in working on it. If someone is already working on it, we can collaborate and speed up the process. Please feel free to contact me at saiprasad.jnv@gmail.com.

zdenop commented 3 years ago

@saiprasadjnv you can work on this. @Sintun started to work on this, but never send PR here, so IMO this is unfinished task.

amitdo commented 3 years ago

I suggest to start with testing @Sintun's patches in his fork.

amitdo commented 3 years ago

do we have any sample codes on table detection using tesseract?

https://github.com/tesseract-ocr/tesseract/issues/1714#issuecomment-588183356

balachandarsv commented 3 years ago

I tried with some sample tables, @Sintun's solution works well. Any idea when this would be merged into master?

Sintun commented 3 years ago

I could update it and create a pull request around the next weekend, if someone gives me an absolution for the usage of a singleton approach (an object that shares properties with global variables :( ).

balachandarsv commented 3 years ago

@Sintun any update on the pull request?