Open felixdittrich92 opened 3 years ago
Hi @felixdittrich92,
Thanks for bringing this on the table, it is a very interesting and useful feature. It would be interesting to integrate such a model in doctr, however we need to think about the global architecture: Should it be a separate model (no shared features) from our detection + recognition pipeline (which would for sure slow down the end to end prediction), or should it be integrated to the detection predictor to maximize feature sharing ?
To answer this question we can look at the speed of your model, can you benchmark this on your side ?
If it is fast enough, we can start by implementing it separately in a new module, and it will run independently from the main pipeline. We can first implement the model in pytorch as you suggested, and provide a pretrained version (.pt) in the config, and tackle the dataset/training script integration later on!
Have a nice day ! :smile:
@charlesmindee yes i will do i think later today :) I wish you the same I have attached the tensorboard logs if you want to take a look version_0.zip
@charlesmindee on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz the onnx model takes ~ 3-3.5 sec without (tesseract) OCR (tomorrow i can test the pure .pt model also if you want !?) (I think optimizations are still possible, such as smaller input sizes or model prunning) Sample output:
0 1 2 3 4 5 6
0 Protein-ligand Complex #rotable bonds stoDock Dock FlexX ICM GOLD T10 120
1 3pib 3 0.80 0.59 Mu 0.4 109 0.56 054
2 ing 2 0.62 0.86 108 O71 189 0.70 0.69
3 Lin) 3 121 156 173 2.17 190 142 1.50
4 ink 4 1.69 187 1.70 2.53 308 1.16 14
5 ini 5 2.61 5.26 2.73 3.40 493 2.22 2.22
6 Lipp 7 1.80 3.25 195 un 233 2.43 253
7 Ipph "1 5.14 3.91 3.27 144 43 4.00 0.53
8 Ipht 1 2.09 2.39 4.68 123 42 120 1.20
9 Iphg 5 3.52 537 487 0.46 420 107 108
10 2epp 3 3.40 2.48 04d 2.53 349 3.26 3.27
11 Inse 2 1.40 4.86 6.00 180 102 147 1.40
12 Insd n 1.20 451 156 1.04 096 18s 18s
13 Innb nl 0.92 451 0.92 1.08, 034 1.67 3.97
14 lebx 5 1.33 3.13 132 0.82 187 0.62 0.62
15 Bepa 8 2.22 6.48 151 on 87 2.22 2.22
16 Gepa 16 830 830 9.83 1.60 496 4.00 4.00
17 labe 4 0.16 187 OSS 036 ois 0.56 0.56
18 labf 5 0.48 3.25 0.76 0.61 030 0.68 0.70
19 Sabp 6 0.48 3.89 4.68 oss 030 0.48 O51
20 letr 15 461 6.66 7.26 0.87 $90 1.09 1.09
21 lets B 5.06 3.93 2 6.22 230 197 197
22 lett n 812 133 6.24 0.98, 130 0.82 0.82
23 3tmn 10 4si 7.09 530 136 396 3.65 3.65
24 Stln 4 534 139 633 142 160 421 421
25 ima 20 8.72 778 451 2.60 ssa 221 224
26 apt 30 1.89 8.06 5.95 0.88 882 5.72 4.79
27 lapu 29 9.10 758 843 2.02 1070 132 132
28 2itb 1s 3.09 143 8.94 1.04 26 2.09 5.19
29 teil 6 581 2.78 3.52 2.00 04 1.86 1.86
30 lok 5 854 5.65 422 3.03 385 2.84 2.84
31 Lenx B 10.9 735 683 2.09 632 6.20 6.20
32
What do you think ?
Hi @felixdittrich92,
Thanks for the benchmark, does the ONNX model which takes 3s to run include the OCR task as well (I understand that it doesn't include tesseract but is there any other module appart from the raw tablenet ?) ? If so, we should benchmark to tab detection part alone, and if it is only the tab detection module it seems quite slow (we are aiming at ~1s inference per page for our end to end pipe on CPU, maybe more if the document is large), and we should see how we can optimize that.
Have a nice day! :smile:
@charlesmindee yes currently the pure table segmentation needs ~3sec for this reason i have wrote model prunning, a smaller input size, the try for teacher / student experiment or else can be helpful to optimize. I currently have an internal problem to take care of, so I probably won't get to it in the near future (just like with the reorganized problem # 512). However, if you want, I can send you the data set and the training scripts !?
I wish you the same
Hi @felixdittrich92,
It is absolutely not a problem if we don't take care of this in the near future, It could be indeed great for us if you could share the dataset/training scripts but don't get too wrapped up in it!
Best!
@charlesmindee you can download it (also my pretrained) at Dataset_Model_Trained tell me if you got it :) One thing: if you train this on a multi gpu system before saving the model you have to set the world rank to zero or save after training from checkpoint :)
Topic for contrib
module
add a tablenet model to extract tabular data as dataframe from images (i have a ready to use model(.pt) trained on marmot dataset and need a bit guidiance where to add - prefered as onnx and for self training i can add also in reference /same for dataset but only in Pytorch (Lightning))
After the restructuring / hocr pdfa export @fg-mindee @charlesmindee