mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
https://mindee.github.io/doctr/
Apache License 2.0
3.67k stars 425 forks source link

Adding TableNet model to extract tabular data #524

Open felixdittrich92 opened 3 years ago

felixdittrich92 commented 3 years ago

add a tablenet model to extract tabular data as dataframe from images (i have a ready to use model(.pt) trained on marmot dataset and need a bit guidiance where to add - prefered as onnx and for self training i can add also in reference /same for dataset but only in Pytorch (Lightning))

After the restructuring / hocr pdfa export @fg-mindee @charlesmindee

charlesmindee commented 2 years ago

Hi @felixdittrich92,

Thanks for bringing this on the table, it is a very interesting and useful feature. It would be interesting to integrate such a model in doctr, however we need to think about the global architecture: Should it be a separate model (no shared features) from our detection + recognition pipeline (which would for sure slow down the end to end prediction), or should it be integrated to the detection predictor to maximize feature sharing ?

To answer this question we can look at the speed of your model, can you benchmark this on your side ?

If it is fast enough, we can start by implementing it separately in a new module, and it will run independently from the main pipeline. We can first implement the model in pytorch as you suggested, and provide a pretrained version (.pt) in the config, and tackle the dataset/training script integration later on!

Have a nice day ! :smile:

felixdittrich92 commented 2 years ago

@charlesmindee yes i will do i think later today :) I wish you the same I have attached the tensorboard logs if you want to take a look version_0.zip

felixdittrich92 commented 2 years ago

@charlesmindee on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz the onnx model takes ~ 3-3.5 sec without (tesseract) OCR (tomorrow i can test the pure .pt model also if you want !?) (I think optimizations are still possible, such as smaller input sizes or model prunning) Sample output:

                                                0     1      2      3     4     5     6
0   Protein-ligand Complex #rotable bonds stoDock  Dock  FlexX    ICM  GOLD   T10   120
1                                     3pib 3 0.80  0.59     Mu    0.4   109  0.56   054
2                                      ing 2 0.62  0.86    108    O71   189  0.70  0.69
3                                      Lin) 3 121   156    173   2.17   190   142  1.50
4                                      ink 4 1.69   187   1.70   2.53   308  1.16    14
5                                      ini 5 2.61  5.26   2.73   3.40   493  2.22  2.22
6                                     Lipp 7 1.80  3.25    195     un   233  2.43   253
7                                    Ipph "1 5.14  3.91   3.27    144    43  4.00  0.53
8                                     Ipht 1 2.09  2.39   4.68    123    42   120  1.20
9                                     Iphg 5 3.52   537    487   0.46   420   107   108
10                                    2epp 3 3.40  2.48    04d   2.53   349  3.26  3.27
11                                    Inse 2 1.40  4.86   6.00    180   102   147  1.40
12                                    Insd n 1.20   451    156   1.04   096   18s   18s
13                                   Innb nl 0.92   451   0.92  1.08,   034  1.67  3.97
14                                    lebx 5 1.33  3.13    132   0.82   187  0.62  0.62
15                                    Bepa 8 2.22  6.48    151     on    87  2.22  2.22
16                                    Gepa 16 830   830   9.83   1.60   496  4.00  4.00
17                                    labe 4 0.16   187    OSS    036   ois  0.56  0.56
18                                    labf 5 0.48  3.25   0.76   0.61   030  0.68  0.70
19                                    Sabp 6 0.48  3.89   4.68    oss   030  0.48   O51
20                                    letr 15 461  6.66   7.26   0.87   $90  1.09  1.09
21                                    lets B 5.06  3.93      2   6.22   230   197   197
22                                     lett n 812   133   6.24  0.98,   130  0.82  0.82
23                                    3tmn 10 4si  7.09    530    136   396  3.65  3.65
24                                     Stln 4 534   139    633    142   160   421   421
25                                    ima 20 8.72   778    451   2.60   ssa   221   224
26                                    apt 30 1.89  8.06   5.95   0.88   882  5.72  4.79
27                                   lapu 29 9.10   758    843   2.02  1070   132   132
28                                   2itb 1s 3.09   143   8.94   1.04    26  2.09  5.19
29                                     teil 6 581  2.78   3.52   2.00    04  1.86  1.86
30                                      lok 5 854  5.65    422   3.03   385  2.84  2.84
31                                    Lenx B 10.9   735    683   2.09   632  6.20  6.20
32        

What do you think ?

charlesmindee commented 2 years ago

Hi @felixdittrich92,

Thanks for the benchmark, does the ONNX model which takes 3s to run include the OCR task as well (I understand that it doesn't include tesseract but is there any other module appart from the raw tablenet ?) ? If so, we should benchmark to tab detection part alone, and if it is only the tab detection module it seems quite slow (we are aiming at ~1s inference per page for our end to end pipe on CPU, maybe more if the document is large), and we should see how we can optimize that.

Have a nice day! :smile:

felixdittrich92 commented 2 years ago

@charlesmindee yes currently the pure table segmentation needs ~3sec for this reason i have wrote model prunning, a smaller input size, the try for teacher / student experiment or else can be helpful to optimize. I currently have an internal problem to take care of, so I probably won't get to it in the near future (just like with the reorganized problem # 512). However, if you want, I can send you the data set and the training scripts !?

I wish you the same

charlesmindee commented 2 years ago

Hi @felixdittrich92,

It is absolutely not a problem if we don't take care of this in the near future, It could be indeed great for us if you could share the dataset/training scripts but don't get too wrapped up in it!

Best!

felixdittrich92 commented 2 years ago

@charlesmindee you can download it (also my pretrained) at Dataset_Model_Trained tell me if you got it :) One thing: if you train this on a multi gpu system before saving the model you have to set the world rank to zero or save after training from checkpoint :)

felixdittrich92 commented 4 months ago

Topic for contrib module