mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
688 stars 125 forks source link

recognition training: one_channel_mode not always promoted #522

Closed bertsky closed 3 months ago

bertsky commented 1 year ago

In OCR-D, we rely on the model metadata correctly specifying whether images should be binarized or not – via one_channel_mode property.

However, there seems to be a gap in the chain of feature promotion:

  1. https://github.com/mittagessen/kraken/blob/b48145dad2de326af0c1d3c837d0f823d2331412/kraken/lib/train.py#L169-L178 sets the model metadata from the mode of the training dataset
  2. https://github.com/mittagessen/kraken/blob/b48145dad2de326af0c1d3c837d0f823d2331412/kraken/lib/dataset/recognition.py#L321 uses 1 by default
  3. https://github.com/mittagessen/kraken/blob/b48145dad2de326af0c1d3c837d0f823d2331412/kraken/lib/dataset/recognition.py#L394-L418 promotes the im_mode of the whole dataset if it encounters an input sample with "more" channels

However, in (1), the (maximal) case of RGB does not get promoted.

mittagessen commented 1 year ago

Yes, one_channel_mode is used to distinguish between grayscale and b/w inputs because this information can't be infered from the VGSL input spec alone. For an RGB model its value is undefined.

bertsky commented 1 year ago

What I am saying is: if you did have RGB samples, then the model should at least be advertised as L, not 1.

mittagessen commented 1 year ago

What I mean is one should looks at the VGSL input spec of the model to see if it is a model with one channel inputs and then check what kind of one channel inputs it expects. Looking at a field called one_channel_mode for a three channel input doesn't really make sense, no?

bertsky commented 1 year ago

Let me put it this way: As a user, I have no way of knowing that, there's only one_channel_mode in the metadata. And for your default --spechttps://github.com/mittagessen/kraken/blob/b48145dad2de326af0c1d3c837d0f823d2331412/kraken/lib/default_specs.py#L20 … which uses 1 for channels, i.e. binary or grayscale, IIUC any RGB images will be converted to grayscale. Thus, keeping one_channel_mode at 1 is wrong if there were in fact RGB samples.

mittagessen commented 1 year ago

… which uses 1 for channels, i.e. binary or grayscale, IIUC any RGB images will be converted to grayscale. Thus, keeping one_channel_mode at 1 is wrong if there were in fact RGB samples.

It will be set to L because the im_mode on the dataset object is set to the highest image mode after applying the image transforms (which would produce grayscale images in this case). Anything else is a bug.

The canonical way to determine what kind of input a model expects is to use the VGSL input spec. The additional one_channel_mode attribute is a necessary kludge to be able to disambiguate inputs that are functionally different but map to the same VGSL input block, namely grayscale and binarized images. It doesn't make sense to disambiguate inputs that are already unambiguous.

bertsky commented 1 year ago

It will be set to L because the im_mode on the dataset object is set to the highest image mode after applying the image transforms (which would produce grayscale images in this case). Anything else is a bug.

Ah, now I can see it, thanks for clarifying.

But then why do @stweil's Kraken models mostly have mode==1, despite being trained on grayscale?

The canonical way to determine what kind of input a model expects is to use the VGSL input spec. The additional one_channel_mode attribute is a kludge necessary because VGSL doesn't have a way to distinguish between grayscale and binary inputs (which makes a difference in recognition quality, unfortunately).

I understand, but

mittagessen commented 1 year ago

But then why do @stweil's Kraken models https://github.com/OCR-D/ocrd_kraken/pull/38#issuecomment-1607471614, despite being trained on grayscale?

I don't know when he trained those models but in the past there was a bug where the mode wouldn't be upgraded because of the dataset running in separate processes. I think I had fixed that a while ago but I can check again.

then the VGSL spec itself (or just the channels parameter) should also be added to the model metadata IMHO

It is available in the input attribute of the model as a tuple (batch, channels, height, width).

bertsky commented 1 year ago

I don't know when he trained those models but in the past there was a bug where the mode wouldn't be upgraded because of the dataset running in separate processes.

I wouldn't know. He did use eScriptorium AFAIK.

It is available in the input attribute of the model as a tuple (batch, channels, height, width).

Oh, that does help! Should be documented IMO.

stweil commented 1 year ago

I wouldn't know. He did use eScriptorium AFAIK.

~Most~ All model trainings were done on the command line with the latest kraken code (at the time of the training start) from GitHub. It is not possible to train models with a given network specification from several GT sources in eScriptorium.

~Only the typewriter model was trained in eScriptorium.~

mittagessen commented 1 year ago

Oh, that does help! Should be documented IMO.

I can do that. Most of that stuff is stable but hasn't been documented as the correct image transforms get run automatically inside the respective methods.

bertsky commented 1 year ago

I can do that. Most of that stuff is stable but hasn't been documented as the correct image transforms get run automatically inside the respective methods.

@mittagessen It's your decision. For me it took some time to dig through your rpred and blla before I understood how to use the API for OCR-D (esp. around multi-model / tag mechanism and segmentation formats). So the documentation would need more than just that detail improved. But as long as it does not change, I am fine.

@stweil Do you still have some logs? Perhaps we could pinpoint the version, or at least see the Upgrading messages... I would like to fix the model metadata, and from my testing, I am confident they benefit from grayscale, but want to be absolutely sure.

mittagessen commented 1 year ago

So the documentation would need more than just that detail improved.

There's a large refactoring happening right now that replaces all the opaque dicts with proper container classes which should clear up a lot of the implicitness in the arguments (and gives me an opportunity to have to rewrite all of the docs).

mittagessen commented 3 months ago

Apparently, I hadn't corrected the image mode promotion after all. There's a fix in fix/8bit_mode_dataloader which I need to verify and merge.

bertsky commented 3 months ago

Apparently, I hadn't corrected the image mode promotion after all. There's a fix in fix/8bit_mode_dataloader which I need to verify and merge.

I don't understand. So the documentation issue (which IIUC it turned out to be) is fixed, hence you closed, but then there's also https://github.com/mittagessen/kraken/compare/main...fix/8bit_mode_dataloader, which is what exactly – a fix to the original problem, or to some new problem (perhaps a race when using multiprocessing)?

Also, more broadly, regarding your refactoring, since you have not released anything yet, could you please give me an early warning as to any API changes I need ocrd_kraken to adapt towards?

mittagessen commented 3 months ago

which is what exactly – a fix to the original problem, or to some new problem (perhaps a race when using multiprocessing)?

The im_mode variable wasn't shared between processes so it only got promoted inside the DataLoader worker processes and not the kraken main process that would actually set the flag in the model metadata.

Also, more broadly, regarding your refactoring, since you have not released anything yet

5.0 has been released on pypi but I haven't done the github release yet as the conda build broke for some unfathomable reason. The github release will include an upgrade guide to the new container classes.