mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
750 stars 131 forks source link

Help with VGSL syntax #527

Closed lamaeldo closed 1 year ago

lamaeldo commented 1 year ago

Hello, I am trying to train a segmenter model on a large dataset of labelled tables. The best model I have managed to train are very good at identifying the segment of texts that constitute the cells (with a good baseline and bounding box), but is sometime struggling to not merge two horizontally adjacent cells into one text line. This is because the text can get relatively close to the vertical separator of the columns, and said separator can be sometime quite dim on my scans. However, the width of the text line is very similar between the cell across my dataset. I was thus wondering if there was any way to define a VGSL architecture that would be able to incorporate some "geometric" knowledge of the habitual dimensions of the textlines and their relative locations (again, my dataset consists of essentially one type of highly standardized document). I have tried with pretty much all the segmenter architectures I was able to find online, and the following worked best: [1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32 Cr3,3,32 Gn32 Cr3,3,64 Gn32] Here are two examples of the tables I am working with: here Thanks in advance!

mittagessen commented 1 year ago

You probably won't get large changes in precision by changing the architecture. There's always the possibility of increasing the input size/final size of the feature map which tends to increase separation a bit. So if you've got the GPU memory either change the input block to something like 1,1800,0,3 or remove the dilation from the downsampling convolutional blocks (Cr...,2,2 to Cr....).

lamaeldo commented 1 year ago

Thanks for the advice! Tweaking the dilation improved the precision a lot in my case

rohanchn commented 1 year ago

Hi @lamaeldo, did you remove the dilation or changed it to some other value? Asking, as this may be helpful to me in one of my projects.

lamaeldo commented 1 year ago

So within the constraint of my 12 GB of VRAM, i found that removing the first dilation altogether worked best! If my understanding is correct, changing the dilation factor from 2 to 1 (in my case) would have been equivalent to removing the dilation out of the convolution completely. I ended up with: [1,1200,0,3 Cr7,7,64 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32 Cr3,3,32 Gn32 Cr3,3,64 Gn32]

mittagessen commented 1 year ago

If my understanding is correct, changing the dilation factor from 2 to 1 (in my case) would have been equivalent to removing the dilation out of the convolution completely

That is indeed correct.

rohanchn commented 1 year ago

Thanks, this is very helpful! I will try this tomorrow.