ketos linegen CLI -d is ambiguous

bertsky commented 2 years ago

In ketos linegen, you currently have:

  -d, --disable-degradation       Dont degrade output lines.

  -d, --distort FLOAT             Mean of folded normal distribution to take
                                  distortion values from

You might want to rename one, e.g. -D.

mittagessen commented 2 years ago

The module hasn't been touched in a long time and should definitely be revisited. At least with the older shallow network architecture synthetic data didn't actually work in improving or even bootstrapping a rough working model.

bertsky commented 2 years ago

The module hasn't been touched in a long time and should definitely be revisited. At least with the older shallow network architecture synthetic data didn't actually work in improving or even bootstrapping a rough working model.

Ah, good to know. And that applies to handwriting, or print, or both?

Also, what exactly is shallow for you here (or what is deep)? For example, Tesseract's default 1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx192 seems much less wide and deep compared with other systems' defaults (IIUC):

1,48,0,1 Ct3,3,40 Mp2,2 Ct3,3,60 Mp2,2 Lfx100 Lrx100 (from Wick et al. 2018 for print)
1,48,0,1 Ct3,3,64 Mp2,2 Ct3,3,128 Mp2,2 Lfx100 Lrx100 (from Wick et al. 2018 for print)
1,48,0,1 Ct3,3,16 Ct3,3,24 Ct3,3,36 Ct3,3,54 Ct3,3,82 Ct3,3,124 Mp2,2 Lfx350 Lrx350 (from Liebl&Burghardt 2020 for print)
1,128,0,1 Ct3,3,16 Mp2,2 Ct3,3,32 Mp2,2 Ct3,3,48 Mp2,2 Ct3,3,64 Ct3,3,80 Lfx256 Lrx256 Lfx256 Lrx256 Lfx256 Lrx256 Lfx256 Lrx256 Lfx256 Lrx256 (from Puigcerver 2017 for handwriting)

Assuming I got that right, where would Kraken's old and new default fit in?

mittagessen commented 2 years ago

That was only for print and with the non-pytorch single BiLSTM layer model.

The current default is [1,48,0,1 Cr4,2,32,4,2 Gn32 Cr4,2,64,1,1 Gn32 Mp4,2,4,2 Cr3,3,128,1,1 Gn32 Mp1,2,1,2 S1(1x0)1,3 Lbx256 Do0.5 Lbx256 Do0.5 Lbx256 Do0.5] so somewhere between Burghardt and Puigcerver. The one we use for most handwriting is [1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do] but that one has the drawback that it doesn't converge for small datasets (which is the reason we haven't made it the default yet despite reducing CER by ~75% for handwriting).

Tesseract is a bit weird. After writing the initial VGSL implementation I tried to use tesseract's specs replicating their hyperparameters as much as possible but I never got anything with LfysXX layers to even remotely reproduce their numbers. It has to be said though that tesseract's training procedure is decidedly non-standard: there's backtracking on plateaus, a per layer LR heuristic in addition to what Adam does, a weird CTC implementation that mirrors a bit the Breuelian formulation, and all kind of custom bits and bobs. Even just the summarizing LSTM layers by themselves are esoteric enough that I haven't seen anyone else using them.

EDIT: On a test set (print, polytonic Greek, single font, 2.5k lines, binary) I get 99.4% character accuracy with a summarizing layer and 99.7% with the large configuration.

bertsky commented 2 years ago

Interesting, thanks!

I did not look much into Tesseract's training procedure yet (good to know). Its many other performance optimizations make it already impossible to precisely compare and reproduce I'm afraid.

but I never got anything with LfysXX layers to even remotely reproduce their numbers Even just the summarizing LSTM layers by themselves are esoteric enough that I haven't seen anyone else using them.

Do you mean the "implicit baseline normalization" (described here p.21)? Perhaps other systems either rely on explicit dewarping, or use 2DLSTMs, or simply try to compensate with larger input height? But your last edit suggests you did apply this successfully – so how does it compare to the same config without LXysXX?

mittagessen commented 2 years ago

On 21/11/18 04:37AM, Robert Sachunsky wrote:

I did not look much into Tesseract's training procedure yet (good to know). Its many other performance optimizations make it already impossible to precisely compare and reproduce I'm afraid

Yeah I was only talking about the training procedure itself. From then it keeps being weird with their CTC decoder and this thing that's similar to the many-to-many codecs kraken has but all mixed into one code blob.

Do you mean the "implicit baseline normalization" (described here p.21)? Perhaps other systems either rely on explicit dewarping, or use 2DLSTMs, or simply try to compensate with larger input height? But your last edit suggests you did apply this successfully – so how does it compare to the same config without LXysXX?

They are a bit independent although spatial normalization is probably one of the ideas behind the summarizing layers. Any architecture with sufficient power will be able to generalize across baseline deviations (input height has nothing to do with it). They probably put that in the presentation because Thomas Breuel was at Google at the time and the old ocropus had this heuristic CenterLineNormalizer. For Tesseract it is a bit of a mood point I guess as their line extractor is so old it can't find anything but the straightest of lines anyway.

OCR systems using the baseline paradigm for segmentation get auto-normalized lines for recognition as you can just map the baseline into the plane with a piecewise affine transform which works well even for extreme curvatures while the implicit (or network-internal approaches such as STNs) have limits.

For comparison, I get (character accuracy on the Greek print set):

99.76% with [1,128,0,1 Ct3,3,16 Gn8 Mp2,2 Ct3,3,32 Gn16 Mp2,2 Ct3,3,48 Gn16 Mp2,2 Ct3,3,64 Gn32 Ct3,3,128 Gn32 S1(1x0)1,3 Lbx96 Do Lbx96 Do Lbx192 Do]
99.24% with [1,128,0,1 Ct3,3,16 Gn8 Mp2,2 Ct3,3,32 Gn16 Mp2,2 Ct3,3,48 Gn16 Mp2,2 Ct3,3,64 Gn32 Ct3,3,128 Gn32 Lfys48 Lbx96 Do Lbx96 Do Lbx192 Do]

and the second one converges a lot slower (ca. epoch 50, in contrast to 30 for other architecture).

bertsky commented 2 years ago

input height has nothing to do with it

I disagree: if you do normalize ("deslope"/dewarp) the baseline in advance, then the same height contains more information. And if you rely on vertical summarization to do the job implicitly, then you obviously need larger height in the input.

They probably put that in the presentation because Thomas Breuel was at Google at the time and the old ocropus had this heuristic CenterLineNormalizer.

It does not contain that kind of code, though.

For Tesseract it is a bit of a moot point I guess as their line extractor is so old it can't find anything but the straightest of lines anyway.

Right, but that probably does not matter much, because you can do line detection externally, (and during training you can still augment by warping).

OCR systems using the baseline paradigm for segmentation get auto-normalized lines for recognition as you can just map the baseline into the plane with a piecewise affine transform which works well even for extreme curvatures while the implicit (or network-internal approaches such as STNs) have limits.

I agree, external/explicit dewarping is probably more robust (but let's see how the new transformer / multi-head self-attention architectures fare).

For comparison, I get (character accuracy on the Greek print set): [...] and the second one converges a lot slower (ca. epoch 50, in contrast to 30 for other architecture).

I see – thanks! (Perhaps the vertical summary could be trained/regulated specially to converge faster?)

mittagessen commented 2 years ago

I disagree: if you do normalize ("deslope"/dewarp) the baseline in advance, then the same height contains more information. And if you rely on vertical summarization to do the job implicitly, then you obviously need larger height in the input.

OK, I formulated that badly. For some material larger input heights result in better results (and we've seen that for many Hebrew manuscripts) but I don't believe this to be related to any improved capability to compensate for baseline position. I'm pulling this out of my ass but naïvely I'd expect implicit baseline compensation to improve with additional contextual information and not necessarily just by having the same information with a higher resolution (in fact it could be detrimental as the receptive field of the convolutional stack is limited).

It does not contain that kind of code, though.

Yes, as I said a lot of the ocropus-y features in that presentation never ended up in Tesseract.

(but let's see how the new transformer / multi-head self-attention architectures fare).

For now they mostly seem to require more training data for the same results with slower inference. At least that's what the literature (and some quick experiments on my side) suggest.

I see – thanks! (Perhaps the vertical summary could be trained/regulated specially to converge faster?)

Yeah, I didn't fiddle around with the hyperparameters much. Doing hyperparameter search with kraken is a bit of a pain right now as the datasets are loading so slowly. It is entirely possible that Tesseracts explicit per-layer learning rates beyond what Adam does where added for those layers. But IDK, in the end you can probably get the exact same result with a stack of 1xX convolutional layers when using a fixed input height.

mittagessen / kraken

ketos linegen CLI -d is ambiguous #306