mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
688 stars 125 forks source link

Improve polygon extraction #555

Closed Evarin closed 3 months ago

Evarin commented 7 months ago

Hello,

I am currently working for the project FONDUe at Université de Genève, on their eScriptorium instance.

While investigating on their performance issues, it appeared that a lot of time is spent warping the extracted lines. More specifically, scikit-image functions draw.polygon and warp are really inefficient, and the kraken functions could be much faster if they used Pillow Image.transform -- or at least scipy.ndimage affine_transform.

While working on those improvements, I also noticed that the current kraken extract_polygon function caused strange warping artifacts on complex baselines, due to the use of skimage PiecewiseAffineTransform. In order to improve the quality of the warping, and make it compatible with Pillow transform syntax, I wrote a smoother warping function (_bevelled_warping_envelope) that deforms much less the characters of the target line.

Here is an example of the results (older version above, new proposition below, the plot shows the computed envelope transformation): kraken_faster

You can see more of them on this colab: my modified functions run 30 to 60 times faster that the older ones, while also enabling antialiasing (which was disabled in kraken 4.3.10).

I integrated them in an older version of kraken, and it works properly (and greatly improve transcription performance). But I see you are currently working on changing the input formats, and the tests are currently broken, so I could not check that everything is working well with this PR. I still propose it now for the record, and would be very happy to work on it further when tests are ready :)

mittagessen commented 7 months ago

Thanks, this is great! The anti-aliasing was disabled because of B/W lines which obviously aren't anymore with it. Disabling it didn't seem to have any accuracy impact on grayscale models so just turning it off was easiest.

The important question is if it impacts the accuracy of existing models. Have you verified that? In general recognition models are quite sensitive to changes in polygonization/preprocessing so it is possible they might not be "compatible" with your new, faster alternative. I'd still merge it but we'd need to add a flag in the model metadata so the recognizer knows which extractor to use.

Evarin commented 7 months ago

Hi, thanks for your answer!

Indeed, that is something important, as CNN models can be very sensitive to this kind of change. As the new _rotate uses now the same method as the bounds.type='boxes' (and I checked the difference with the old function, it's very small), and the new beveled_warping yields results closer visually to this _rotate function, it should be fine. Moreover, the images are later resized when inputted to the neural network, so it should smooth aliasing differences.

I've run ketos test Manu_McFrench with the original and the new version (at v4.3.13, since the current upstream is broken), on a 57-page printed (Balzac) dataset, and obtained very similar performance: acc 77.25% on old version, 77.15% on the new. (And the new method was 6x faster)

I'll test it on https://github.com/HTR-United/CREMMA-MSS-19 later; if you have another reference model & dataset I'd be happy to check!

mittagessen commented 7 months ago

On 23/12/04 09:42AM, Robin Champenois wrote:

Indeed, that is something important, as CNN models can be very sensitive to this kind of change. As the new _rotate uses now the same method as the bounds.type='boxes' (and I checked the difference with the old function, it's very small), and the new beveled_warping yields results closer visually to this _rotate function, it should be fine. Moreover, the images are later resized when inputted to the neural network, so it should smooth aliasing differences.

Box-format models and baseline models aren't compatible with each other in the sense that one will recognize something with the other input format but recognition quality is fairly abysmal. So comparing the output to the box format doesn't mean much.

I've run ketos test Manu_McFrench with the original and the new version (at v4.3.13, since the current upstream is broken), on a 57-page printed (Balzac) dataset, and obtained very similar performance: acc 77.25% on old version, 77.15% on the new. (And the new method was 6x faster)

Those error rates are close to random so a slight difference there doesn't mean much.

I'll test it on https://github.com/HTR-United/CREMMA-MSS-19 later; if you have another reference model & dataset I'd be happy to check!

The BiblIA dataset is decent and there's a corresponding model here.

Evarin commented 7 months ago

Ok, thanks, I'll run it with BiblIA! (but it's a pretty big one, it will take a bit of time)

So far, with CREMMA and ManuMcFrench, I get 91.39% with the old method, and 90.35% with the new one (knowing that it uses the "slow path" in 25% of the cases, and the main recognition errors in both cases seem to be a format issue between dataset and model: 786 { é } - { e } / 702 { } - { COMBINING ACUTE ACCENT })

mittagessen commented 7 months ago

On 23/12/05 01:16AM, Robin Champenois wrote:

Ok, thanks, I'll run it with BiblIA! (but it's a pretty big one, it will take a bit of time)

A subset is fine. The model has been trained on most of that data so running everything will just get you the accuracy on the training set anyway.

So far, with CREMMA and ManuMcFrench, I get 91.39% with the old method, and 90.35% with the new one (knowing that it uses the "slow path" in 25% of the cases, and the main recognition errors in both cases seem to be a format issue between dataset and model: `786 { é }

  • { e }/702 { } - { COMBINING ACUTE ACCENT }`)

Thanks, I'll run some tests myself. 1 percentage point drop isn't good and would most likely require the aforementioned switch but I'll fiddle around with the smoothing/anti-aliasing to see if that changes things and also validate against training. Just because it looks better to a human might not necessarily mean that it improves recognition quality (for models trained with your implementation that means).

BTW, I've fixed inference on main, so if you'd prefer working on that it would be possible.

Evarin commented 7 months ago

Ok, thanks. I tried to merge main into my branch but ketos test still throws some errors. If you want to do some tests, I have this branch that's working, based on 4.3.13: https://github.com/Evarin/kraken/tree/f-improve

Evarin commented 7 months ago

On a (1100 lines) BiblIA subset, I get 98.36% with the old method, 97.91% with the new one (97.89% with the new one, disabling antialias). So, a 0.35% difference. Most of the processing was done on the multi-line path.

Evarin commented 5 months ago

Hello! I'm coming back to advance and finish the work about this pull request, sorry for the long absence.

Extensive testing

I've been doing more extensive qualitative and quantitative analysis on the changes I've done. Basically, there are three main ones:

I was expecting the new mask detouring and polygon math to have little to no effect, as it really doesn't change much to the images, but I discovered it does change the result, dropping 0.1-0.5% of accuracy. Diving deeper, I realized it changes the statistics of the resulting image, making it very slightly (and randomly) brighter or darker, which the neural network (at least the standard HTR-United-ManuMcFrench) appears to be sensitive to. I don't know if that's included in ketos training pipeline, but it suggests that data augmentation on brightness might be quite useful for robustness :).

Those perturbations are actually comparable in magnitude to the ones caused by the new warping method. I was thinking of proposing two versions of my improvements, one with the current warping method in a faster implementation, and one with the new warping method, but it does not seem relevant, as the models seem to be overfit to this specific implementation.

Integration options

So, how can we integrate those changes to the project? If we want to let options to the user to use the old or the new method, I think the best way is to

I'd advocate for the fast-polygon option to be enabled by default all the time, as I think the speedup overweight the errors (which are statistically quite small, knowing that the models already make mistakes and require proofreading). But if we really want to be conservative, we could only enable it by default on training/finetuning, and on inference with models trained with it.

What do you think? I would be happy to update my code to work with the current codebase, and implement the most relevant flag system.

mittagessen commented 5 months ago

Sorry from my side on not following up on this either. I ran a couple of experiments myself in December to see which component actually causes the loss of accuracy but couldn't identify one particular thing.

My plan was to implement the second option. On good models the 0.5 percentage points decrease in accuracy is an increase in overall errors between 33% and 50% which would be fairly unacceptable, especially as most people do not really care if it takes 0.5 or 1 second per line (I know large digitization projects are different). Add a flag to the model (and binary dataset) metadata when it was trained with the new line normalizer, set it as a default on the training side (I wouldn't even give an option to fall back to the legacy implementation there), and fall back to the old implementation on inference when the flag is absent. Plus printing a big fat warning on the command line that the user might consider retraining with the new method because it is so much faster.

Evarin commented 5 months ago

Ok, let's do that then. How do you want to proceed? I can update my PR to be up to date with main and implement this hybrid behavior, is that fine for you?

mittagessen commented 5 months ago

Yes, if it isn't too much trouble for you I'd appreciate you updating the pull request for current main. If not I can do it by mid-March.

It is actually the last thing I wanted to get into the main branch before tagging a new stable release.

Evarin commented 5 months ago

Ok I'll work on it then!

Evarin commented 4 months ago

Hello,

I think I'm mostly done with the implementation of what we discussed. Basically:

A more detailed description of the expected behavior is in test_newpolygons.py. The code yields warnings to suggest people to retrain, or when inconsistent usage is detected.

I've tried to make my code as clear as possible, but I'm open to comments and suggestions :). Tests pass, except all testtrain (but it looks broken on main too), and my `pretrain` tests (but it does not seem to be due to my changes? I've skipped them for now)

mittagessen commented 4 months ago

Thanks. There's one small thing. The binary datasets (lib/arrow_dataset.py and ArrowIPCRecognitionDataset in lib/dataset/recognition.py) also need the legacy polygon switch.

I'll have a look at the pretrain tests.

Evarin commented 4 months ago

Oh, right, I forgot those.

I've added support, but as they are precompiled it's less convenient to track their status :/. For now I've added the flag to the Arrow IPC file metadata, and only throw a warning if we try to train/test expecting new polygons but use an old-fashioned dataset. See the tests, that checks for the presence of the warning.

It would be more convenient if we could toggle it based on the actual dataset info, but it seems quite complicated to do so, and might be unreliable. Or, we could make the warnings an error. Or keep it like that. What do you think?

mittagessen commented 4 months ago

The test could be just a new flag in the metadata (as it is right now) but it needs to check internally that when loading multiple arrow files the values are congruent. So, I'd just print a warning in case of flag mismatch and another warning when the flag implies the legacy extractor was used to compile the file.

I would not add an argument to the dataset class (or the trainer even). Thinking about it, I'd like to strongly discourage producing new models with the old polygon extractor.

Evarin commented 4 months ago

Thanks for your comment, I had not seen the edited version of it (I just read my mail...), so I started implementing something slightly too complicated :/

My motivation for letting the parameter in the Dataset class is that people may want to evaluate the difference in accuracy between the old method and the new one. So maybe : keep the optional flag in test and in Dataset class, but remove it elsewhere, and force-upgrade when training (except in the very special case of old arrow datasets), with warnings when arrow dataset are old or incongruent?

mittagessen commented 4 months ago

My apologies. Yes, that makes sense. I'll look through everything and merge in the second half of the week.

Evarin commented 4 months ago

Okay, I can make those changes, or wait for you to look through the code and comment precisely the changes you'd like to see -- and avoid other misunderstandings ^^". What do you prefer?

mittagessen commented 4 months ago

On 24/03/06 02:48AM, Robin Champenois wrote:

Okay, I can make those changes, or wait for you to look through the code and comment precisely the changes you'd like to see -- and avoid other misunderstandings ^^". What do you prefer?

I think I'd prefer looking through the code tomorrow. If there are just a few small changes I might just merge as is and do them myself as people are complaining that I should tag a new release ;)

mittagessen commented 3 months ago

Sorry for the delay. I've merged everything and it seem to work as intended. I'll write a release note and then tag a new release.

Evarin commented 3 months ago

Great, thanks!

anutkk commented 3 months ago

This new mechanism indeed speeds up things about 10x, but in my case (BiblIA on Geneve 146) it changes significantly the output. For example, the line: image

which is transcribed using the legacy method as:

יש נהכי השהא שנ ושב יעקב ושרר ושאו ו הא והרשעי

Is transcribed as following by the new method:

'יש נהכי השהא שנ ושב יעקב ושר ושאנ : ו הא והרשעי

Admittedly not a good result to begin with, I just want to illustrate it does affect the model.

I have coded a much lighter optimization (basically just changing np.array tonp.asarray; skimage.draw.polygon to PIL.ImageDraw.Draw; and using a faster, vectorized version of skimage.transform.PiecewiseAffineTransform. It speeds up by "only" 3x, but it does not change the output at all.

@mittagessen If we prove that it is really equivalent (PIL.ImageDraw.Draw is different by only a few pixels, and the new PiecewiseAffineTransform has been shown to change the output by less than 1e-15 - maybe this could be a transparent improvement to the legacy extraction?

@Evarin could you share the code you used to measure the error rate on the BiblIA dataset?

mittagessen commented 3 months ago

If we prove that it is really equivalent (PIL.ImageDraw.Draw is different by only a few pixels, and the new PiecewiseAffineTransform has been shown to change the output by https://github.com/scikit-image/scikit-image/pull/6963#issuecomment-1872077806 - maybe this could be a transparent improvement to the legacy extraction?

Yes, if there's no change in recognition accuracy I'd say we can modify the legacy code path to at least get a partial speedup.

Evarin commented 3 months ago

I tested the old and new code from my branch f-improve, and using ketos test Basically, you can toggle the different modifications I made using https://github.com/Evarin/kraken/blob/dbeec21c8f8b91d6a80d7f717e310a8817d23ba6/kraken/lib/segmentation.py#L49 (with warp=legacy|fast_legacy|new, and the others legacy|new)

The "fast_legacy" is already an attempt to improve the PiecewiseAffineTransform, but using PIL, and it was already leading to significant changes, so we did not keep it.

(Also, in my evaluations, I don't know what was in the training set of the model I use, I can't measure how much of it is from the model being overfit to the exact pixels, so I trust you for saying it's a problem or not...)

anutkk commented 3 months ago

Thanks! Running this now.

By the way, if the polygon extraction is now so much faster, then it would make sense to do it on all lines before applying the recognition network, so that the inference can be batched and take advantage of GPU acceleration. This was useless until now, since most of the time was spent on the polygon extraction. But now it could provide an additional speedup, and for large-scale recognition that matters a lot. Admittedly it would require a lot of rewrite of rpred, especially if we want to keep the iterator/generator API.

anutkk commented 3 months ago

If we prove that it is really equivalent (PIL.ImageDraw.Draw is different by only a few pixels, and the new PiecewiseAffineTransform has been shown to change the output by scikit-image/scikit-image#6963 (comment) - maybe this could be a transparent improvement to the legacy extraction?

Yes, if there's no change in recognition accuracy I'd say we can modify the legacy code path to at least get a partial speedup.

@mittagessen I tested on the Italian subset (3748 lines). Results:

Method Accuracy Runtime (minutes:seconds)
Current legacy extraction 97.29% 56:27
Slightly optimized legacy extraction 97.33% 29:26

The speedup is more pronounced for straight baselines (closer to 4x).

The difference in accuracy is only 0.04% (and increased for whatever reason). Close enough?

mittagessen commented 3 months ago

0.04 is below 5% change for a good model so I'd deem that acceptable given the speedup. I'm a bit busy the next couple of days. If you could create a pull request I'll merge it but otherwise I won't get around to it until the end of the month.

anutkk commented 3 months ago

@mittagessen Sure thing, no problem. On the main branch or the v5.0 branch?

mittagessen commented 3 months ago

On 24/04/04 12:46PM, anutkk wrote:

@mittagessen Sure thing, no problem. On the main branch or the v5.0 branch?

main is better. I'm squishing small bugs I've missed in the 5.0 release but otherwise it will be stable.

anutkk commented 3 months ago

On 24/04/04 12:46PM, anutkk wrote: @mittagessen Sure thing, no problem. On the main branch or the v5.0 branch? main is better. I'm squishing small bugs I've missed in the 5.0 release but otherwise it will be stable.

@mittagessen Done: https://github.com/mittagessen/kraken/pull/586