mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
750 stars 131 forks source link

Ability to deploy UFCN configuration on Kraken for segmentation ? #439

Closed PonteIneptique closed 7 months ago

PonteIneptique commented 1 year ago

Hey there :) I have been looking at the paper from Teklia which uses Doc-UFCN (https://gitlab.com/teklia/dla/doc-ufcn/-/blob/main/doc_ufcn/model.py) and mAP for evaluating line segmentation. I was wondering if, in the current state of the Kraken, the configuration of Teklia was implementable ? I am unsure VGSL supports all the operations designed in the model of D-UFCN and I'd like your two cents on this. As for another question: would you be open if I'd propose a mAP metric (which would mean post-process at dev time) for line segmentation ?

PonteIneptique commented 1 year ago

Paper I'm referring to: https://teklia.com/research/publications/boillet2022/

mittagessen commented 1 year ago

On 23/02/15 11:07AM, Thibault Clérice wrote:

I was wondering if, in the current state of the Kraken, the configuration of Teklia was implementable ? I am unsure VGSL supports all the operations designed in the model of D-UFCN and I'd like your two cents on this.

A simple U-Net like theirs should be implementable in VGSL if we add a transposed conv layer but it won't be pretty (lots of nested blocks). In fact the first BLLA implementation was using a U-Net but the ReNet-style network had a much lower memory footprint. While U-Nets can be patched to trade memory for computational time we (and other implementations like the one around OCR-D did so) found them to be much more sensible to scale when used that way.

As for another question: would you be open if I'd propose a mAP metric (which would mean post-process at dev time) for line segmentation ?

Sure. I had a plan to implement the loss metric from the transformer segmenter instead but it wasn't particularly high priority. If you only run the vectorizer it shouldn't be inordinately slow.

PonteIneptique commented 1 year ago

A simple U-Net like theirs should be implementable in VGSL if we add a ...

OK. I'll benchmark first their tool before looking into how to do this for Kraken

Sure. I had a plan to implement the loss metric from the transformer segmenter instead but it wasn't particularly high priority. If you only run the vectorizer it shouldn't be inordinately slow.

Will do in the coming weeks then

mittagessen commented 1 year ago

I just skimmed over the paper right now and you wouldn't really be able to compute mAP out of the box as what they're doing in the article is computing the metric on a line mask. We have those but they don't necessarily correspond to actual segmentation quality as these buffered baseline blobs go through the postprocessing to convert them into area-less baselines. You can of course always just threshold the network output and run their objectedness algorithm...

What I had originally in mind was to run the normal vectorizer, convert the lines (and ground truth) to the cubic Bézier curve format of curt, run the Hungarian matcher, and then compute whatever normalized distance metric with that. That should correspond more closely to actual segmentation quality than pixel metrics.

notiho commented 1 year ago

If anyone is interested in trying out U-net in kraken, I just made #499

PonteIneptique commented 1 year ago

@notiho Looks cool :) Looking forward the merge :)

PonteIneptique commented 1 year ago

If I look at doc_ufcn, it seems that some behaviour are not "available":

For Concat, I see how to append it For BatchNorm, we could have Cnl (normalized ReLu) For ForcedInputSize... I don't know

And finally, I think we'll need to have a naming option for groups...

What do you think @mittagessen ?

mittagessen commented 1 year ago

There's concatenation with parallel layers (). The order of BatchNorm doesn't change anything and in general you won't see large difference between the different normalization layers (GroupNorm, InstanceNorm, LayerNorm, BatchNorm) for this task.

I don't know what you mean with forced input size but you can just set the VGSL input definition to whatever image/patch size you want.

PonteIneptique commented 1 year ago

I don't know what you mean with forced input size but you can just set the VGSL input definition to whatever image/patch size you want.

I am stupid, I mistook the number of channels with the image size while reading your code and UFCN implementation.

PonteIneptique commented 1 year ago

My bad...

notiho commented 1 year ago

Here is my vgsl spec for a UFCN style net (with the limitation regarding normalization you observed): https://gist.github.com/notiho/53acb1db6fa885e206ee3caa728ccfe3

notiho commented 1 year ago

(Spaces replaced by newlines for better legibility)

PonteIneptique commented 1 year ago

Ok, I might be wrong, but I think Doc-UFCN looks like this: image

[
    1,768,0,3 

    Cr{Dil1Pad1}3,3,32,1,1,1,1 Gn32 Do.4 
    Cr{Dil1Pad2}3,3,32,1,1,2,2 Gn32 Do.4
    Cr{Dil1Pad4}3,3,32,1,1,4,4 Gn32 Do.4
    Cr{Dil1Pad8}3,3,32,1,1,8,8 Gn32 Do.4
    Cr{Dil1Pad16}3,3,32,1,1,16,16 Gn32 Do.4

    (
        I 

        [
            Mp{Dil2Start}2,2
            Cr{Dil2Pad1}3,3,64,1,1,1,1 Gn64 Do.4 
            Cr{Dil2Pad2}3,3,64,1,1,2,2 Gn64 Do.4
            Cr{Dil2Pad4}3,3,64,1,1,4,4 Gn64 Do.4
            Cr{Dil2Pad8}3,3,64,1,1,8,8 Gn64 Do.4
            Cr{Dil2Pad16}3,3,64,1,1,16,16 Gn64 Do.4

            (

                I

                [

                    Mp{Dil3Start}2,2
                    Cr{Dil3Pad1}3,3,128,1,1,1,1 Gn128 Do.4 
                    Cr{Dil3Pad2}3,3,128,1,1,2,2 Gn128 Do.4
                    Cr{Dil3Pad4}3,3,128,1,1,4,4 Gn128 Do.4
                    Cr{Dil3Pad8}3,3,128,1,1,8,8 Gn128 Do.4
                    Cr{Dil3Pad16}3,3,128,1,1,16,16 Gn128 Do.4

                    (
                        I

                        [
                            Mp{Dil4Start}2,2
                            Cr{Dil4Pad1}3,3,256,1,1,1,1 Gn256 Do.4 
                            Cr{Dil4Pad2}3,3,256,1,1,2,2 Gn256 Do.4
                            Cr{Dil4Pad4}3,3,256,1,1,4,4 Gn256 Do.4
                            Cr{Dil4Pad8}3,3,256,1,1,8,8 Gn256 Do.4
                            Cr{Dil4Pad16}3,3,256,1,1,16,16 Gn256 Do.4

                            Cr{Conv1Out128}3,3,128,1,1,1,1 Gn128 Do.4 CTr2,2,128,2,2,1,1 Gn128 Do.4 
                        ]

                    )

                    Cr{Conv2Out64}3,3,64,1,1,1,1 Gn64 Do.4 CTr2,2,64,2,2,1,1 Gn64 Do.4 
                ]

            )

            Cr{Conv3Out32}3,3,32,1,1,1,1 Gn32 Do.4 CTr2,2,32,2,2,1,1 Gn32 Do.4
        ]

    )
    Cr{LastConv}3,3,$ClassCount,3,1,1,1,1
]

Note the $ClassCount in the model spec for the last conv, which I did not know to deal with.

PonteIneptique commented 1 year ago

I have a weird error with the previous spec (adapted to a situation where I have 21 classes:

[1,768,0,3
Cr{Dil1Pad1}3,3,32,1,1,1,1
Gn32
Do.4
Cr{Dil1Pad2}3,3,32,1,1,2,2
Gn32
Do.4
Cr{Dil1Pad4}3,3,32,1,1,4,4
Gn32
Do.4
Cr{Dil1Pad8}3,3,32,1,1,8,8
Gn32
Do.4
Cr{Dil1Pad16}3,3,32,1,1,16,16
Gn32
Do.4
(I
[Mp{Dil2Start}2,2
Cr{Dil2Pad1}3,3,64,1,1,1,1
Gn64
Do.4
Cr{Dil2Pad2}3,3,64,1,1,2,2
Gn64
Do.4
Cr{Dil2Pad4}3,3,64,1,1,4,4
Gn64
Do.4
Cr{Dil2Pad8}3,3,64,1,1,8,8
Gn64
Do.4
Cr{Dil2Pad16}3,3,64,1,1,16,16
Gn64
Do.4
(I
[Mp{Dil3Start}2,2
Cr{Dil3Pad1}3,3,128,1,1,1,1
Gn128
Do.4
Cr{Dil3Pad2}3,3,128,1,1,2,2
Gn128
Do.4
Cr{Dil3Pad4}3,3,128,1,1,4,4
Gn128
Do.4
Cr{Dil3Pad8}3,3,128,1,1,8,8
Gn128
Do.4
Cr{Dil3Pad16}3,3,128,1,1,16,16
Gn128
Do.4
(I
[Mp{Dil4Start}2,2
Cr{Dil4Pad1}3,3,256,1,1,1,1
Gn256
Do.4
Cr{Dil4Pad2}3,3,256,1,1,2,2
Gn256
Do.4
Cr{Dil4Pad4}3,3,256,1,1,4,4
Gn256
Do.4
Cr{Dil4Pad8}3,3,256,1,1,8,8
Gn256
Do.4
Cr{Dil4Pad16}3,3,256,1,1,16,16
Gn256
Do.4
Cr{Conv1Out128}3,3,128,1,1,1,1
Gn128
Do.4
CTr2,2,128,2,2,1,1
Gn128
Do.4])
Cr{Conv2Out64}3,3,64,1,1,1,1
Gn64
Do.4
CTr2,2,64,2,2,1,1
Gn64
Do.4])
Cr{Conv3Out32}3,3,32,1,1,1,1
Gn32
Do.4
CTr2,2,32,2,2,1,1
Gn32
Do.4])
Cr{LastConv}3,3,21,3,1,1,1,1]

Kraken throws the following error

│                          /env/lib/python3.10/site-packages/kraken/lib/layers.py:29 in forward    │
│                                                                                                  │
│    26 │   │   i = 0                                                                              │
│    27 │   │   for module in modules:                                                             │
│    28 │   │   │   if type(inputs) == tuple:                                                      │
│ ❱  29 │   │   │   │   inputs = module(*inputs, output_shape=output_shape if i == len(modules)    │
│    30 │   │   │   else:                                                                          │
│    31 │   │   │   │   inputs = module(inputs, output_shape=output_shape if i == len(modules) -   │
│    32 │   │   │   i += 1                                                                         │
│                                                                                                  │
│                          /env/lib/python3.10/site-packages/torch/nn/modules/module.py:1212 in    │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1209 │   │   │   bw_hook = hooks.BackwardHook(self, full_backward_hooks)                       │
│   1210 │   │   │   input = bw_hook.setup_input_hook(input)                                       │
│   1211 │   │                                                                                     │
│ ❱ 1212 │   │   result = forward_call(*input, **kwargs)                                           │
│   1213 │   │   if _global_forward_hooks or self._forward_hooks:                                  │
│   1214 │   │   │   for hook in (*_global_forward_hooks.values(), *self._forward_hooks.values())  │
│   1215 │   │   │   │   hook_result = hook(self, input, result)                                   │
│                                                                                                  │
│                          /env/lib/python3.10/site-packages/kraken/lib/layers.py:51 in forward    │
│                                                                                                  │
│    48 │   │   │   │   outputs.append(module(inputs, output_shape=output_shape))                  │
│    49 │   │   │   if output_shape is None:                                                       │
│    50 │   │   │   │   output_shape = outputs[-1].shape[2:]                                       │
│ ❱  51 │   │   return torch.cat(outputs, dim=1), seq_lens                                         │
│    52                                                                                            │
│    53                                                                                            │
│    54 def PeepholeLSTMCell(input: torch.Tensor,                                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 75 but got size 74 for tensor number 1 in the list.
PonteIneptique commented 1 year ago

It was tied to the maximum size, which is applied on a single side. I moved the input to 768,768 and it works

mittagessen commented 1 year ago

It looks like there's a rounding/padding calculation error somewhere in the code. The shapes are calculated according to the pytorch docs but it is entirely possible that some implementation work slightly differently.

PonteIneptique commented 1 year ago

So, I think this is mostly probably due to how the input image is processed, but I could be wrong. Their implementation uses a common max dimension ( https://gitlab.com/teklia/dla/doc-ufcn/-/blob/main/doc_ufcn/image.py#L8 ). From what I understand (I could be wrong ?), Kraken forces ignoring the ratio through _fixed_resize, which might lead to very bad learning (because the shape is basically squared).

Current results with this architecture are awful btw (990 training files, line segmentation below): image.

As for VGSL, could you confirm that O{l_85}2l21] is not softmax only but also a linear layer ? (Original UFCN has a softmax directly after CNN)

mittagessen commented 1 year ago

If you leave one dimension free the ratio is preserved but as we can have arbitrarily sized filters it is possible that there's a rounding error in the size determination of some layers resulting in unexpected shapes in the concatenation layers. Their code gets around this by knowing the filter shapes in the network in advance and resizing to multiples of 8 to ensure no weird shapes.

Yes, all output layers are either a 1x1 convolution (2D) or a linear projection (1D) + activation.

PonteIneptique commented 1 year ago

Their code gets around this by knowing the filter shapes in the network in advance and resizing to multiples of 8 to ensure no weird shapes.

I might be wrong, again feel free to correct me, but it feels like what they do is resize on maximum value and then pad the rest, no ?

In Kraken, there is currently no way to force a maximum dimension and pad (which I could probably add if necessary).

mittagessen commented 1 year ago

I might be wrong, again feel free to correct me, but it feels like what they do is resize on maximum value and then pad the rest, no ?

Yes, that's exactly what they're doing.

In Kraken, there is currently no way to force a maximum dimension and pad (which I could probably add if necessary).

VGSL doesn't provide good semantics for that out of the box. The best way would probably to add a dummy padding layer type.

PonteIneptique commented 1 year ago

VGSL doesn't provide good semantics for that out of the box. The best way would probably to add a dummy padding layer type.

I feel like this (a Resize or Padding layer) would be quite inefficient, as the image would still be loaded in full size. But yes, we could have a second InputShape Layer P<h>,<w>,<p> where H and W are the maximum value (0 scales automatically), and <p> the padding value (probably 0). Which would lead to having: [1,768,0,3 P0,768,0 ...] or [1,0,768,3 P768,0,0 ...].

But wouldn't [1,768,768,3,0] where 0 is a padding value be more efficient and readable (although given the length of the network, readability has been abandoned there) ? The use of a padding value implying h and w being max values instead of 'bending' values ? (Other option [1,0,0,3,768,0] where 768 is max dimension and 0 padding value).

Yes, all output layers are either a 1x1 convolution (2D) or a linear projection (1D) + activation.

In the context of Kraken, and to replicate UFCN-Doc, did I miss something to address that ? As their output layer is clearly Cm3,3,21,3,1,1,1,1] (Typo in the original post, I wrote Cr[...]. If I did not miss it, how could I address this in the context of Kraken and VGSL ?

I mainly want to test, see if the results are somehow better (to give up on YOLOv5 to some extent), and if they are, I would do a PR :)

*Updated comment since <...> was treated as HTML

notiho commented 1 year ago

Hi, I just had a look at your VGSL spec again and noticed that you added normalization after the up-convolution layers. While this is the way it is done in the reference, in my implementation of transposed convolution this will unfortunately not work at the moment, hence the error regarding concatenation of different sized inputs you got. The reason is that the output size of transposed convolution is underspecified, so when using it in a parallel block, you need to add a hint to specify the output dimensions. In my implementation, this hint is only given to the very last layer of a serial block in a parallel block, so the transposed convolution will need to be the very last layer before the end of the serial. Although this is certainly not ideal, you could try to just remove the normalization there.

notiho commented 1 year ago

An alternative could be to pass that hint to all layers in a serial block, however, this would break if you have more than one transposed convolution in a series.

PonteIneptique commented 1 year ago

Thanks for the comment.

In the context here, given that it did not fail as soon as I bent the input into fixed size without padding, I think it has to do with no max size existing. But I could be wrong (and as a result, maybe this absence of max size breaks on CT ?)

mittagessen commented 1 year ago

Arrgh, forgot to press send on this response.

I feel like this (a Resize or Padding layer) would be quite inefficient, as the image would still be loaded in full size. But yes, we could have a second InputShape Layer P,,

where H and W are the maximum value (0 scales automatically), and

the padding value (probably 0). Which would lead to having: [1,768,0,3 P0,768,0 ...] or [1,0,768,3 P768,0,0 ...].

My idea was more to have the current ratio-preserving input definition as usual and then a padding layer that has a syntax to pad to multiples of n.

The use of a padding value implying h and w being max values instead of 'bending' values ?

The bug is caused by the padding and not the scaling.

(although given the length of the network, readability has been abandoned there)

Oncetorch.compile is stable and coreml is able to serialize variable input shape models reliably my plan was to phase out VGSL over time. It is a fairly limiting definition language and just bolting on new layer types is starting to feel wrong.

PonteIneptique commented 1 year ago

My idea was more to have the current ratio-preserving input definition as usual and then a padding layer that has a syntax to pad to multiples of n.

So P<max-h>,<max-w>,<value> ? Because the idea here is to have "squared" image with maximum size N with padding. (Sorry if I struggle to understand your answer :) )

PonteIneptique commented 1 year ago

Note: wouldn't treating this as a PaddingLayer create issues with annotation scaling?

mittagessen commented 1 year ago

So P,, ? Because the idea here is to have "squared" image with maximum size N with padding. (Sorry if I struggle to understand your answer :) )

You can do that already by just defining fixed dimensions in the input spec and the --pad option in ketos segtrain. I believe the crash would be solved by scaling the image in all dimensions to some multiple of the reduction factor so you don't get random shape changes somewhere in the network. The important part in the UFCN resize function is this sentence:

Pad the image such that it is divisible by 8.
notiho commented 1 year ago

So in principle the Pytorch implementation of transposed convolution is able to handle inputs of any size without requiring any rescaling to multiples of the scale factor, as long as you tell it what output size is required (See the note at https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html). In my experiment, I did not add anything to control the size of the x dimension and it worked without problems. Of course, if you control both dimensions of the input images, you can get away with not passing the output hint to the transposed convolution, but this is not strictly necessary.

PonteIneptique commented 1 year ago

So, digging in to --pad as I missed it.

https://github.com/mittagessen/kraken/blob/daf39d8023a8f6013caaef21937cf89f25c4ed9e/kraken/lib/dataset/utils.py#L97-L100

It seems that 2 fixed dimensions disables padding. And, moreover, Padding is a constant, and is not dynamic: in means that if I put --pad 50 50, I always get a padding of 50 on both side, thus I don't feel like it fixes my issue of having max sized image with padding.