mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
688 stars 125 forks source link

Include in models metadata the Unicode normalization mode #547

Closed PonteIneptique closed 7 months ago

PonteIneptique commented 9 months ago

During discussion around eScriptorium, some of our colleagues were surprised by this information that fine-tuning does not reuse the unicode normalization from the original model, as it is not stored within the model.

I'd like to make sure that such a mode is provided in the model metadata, so that fine-tuning within eScriptorium would be able to reuse this info, without changing the UI of eScriptorium. While adding such an option in eScriptorium would not be that expensive, it would still add complexity to it, and most users would not know which UN to use.

If you give me the go, I'll add that in the various code here (fine-tuning and metadata serialization)

PonteIneptique commented 9 months ago

An additional issue with eScriptorium: someone correcting transcriptions would find themselves with mixed unicode modes, specifically if people are working with NF(K)D produced predictions, as they will produce most likely composed characters as they correct.

mittagessen commented 9 months ago

Hu? Text transformations are in the model metadata (keys normalization and normalize_whitespace) and get reinitialized when fine-tuning with the --load-hyperparameters option.

PonteIneptique commented 9 months ago

This is my bad, I did not know that load-hyperparams did that.

When training large model, this kind of params can be very specific. I am not sure I want everyone to have the same params on eScriptorium in terms of GPU/CPU costs.

However, having model automatically reusing normalization parameters from the model they fine-tune on make sense. Could we have a two tier level for this ?

Sorry my English is garbage, I am jet-lagged and I am not sure French would be better...

rohanchn commented 9 months ago

Since I am never sure that kraken is reusing the correct unicode normalization when training within eScriptorium, I almost always train (even fine-tune) outside eScriptorium. This is obviously not ideal for everyone.

Perhaps it makes sense to provide a direct way to control a few hyperparameters like -u or --lrate in the training form within eScriptorim. But that would introduce complexity, and I cannot comment whether it'd be easy to handle or something everyone would welcome.

mittagessen commented 9 months ago

When training large model, this kind of params can be very specific.

Yes, my opinion during the discussions was actually that we shouldn't load hyper parameters as whatever hyper-specialized (sorry for the pun) settings people used to train a base model are probably not terribly useful for fine-tuning said models. I'm talking about stuff like learning rate schedules and so on.

However, having model automatically reusing normalization parameters from the model they fine-tune on make sense.

I'd argue retaining normalization silently might be the opposite what users would expect. Let's say you're taking a base model trained with any of the canonical normalizations and add new labels to get closer to a diplomatic transcription these would often be normalized away without apparent reason.

Perhaps it makes sense to provide a direct way to control a few hyperparameters like -u or --lrate in the training form within eScriptorim.

Until now I was mostly against this but I just came back from a conference where multiple people requested being able to set basic hyperparams. In addition, it would probably help advanced users to be able to inspect the hyperparams stored in the model somewhere in eScriptorium's model view.

PonteIneptique commented 9 months ago

I'm talking about stuff like learning rate schedules and so on.

Agreed.

I'd argue retaining normalization silently might be the opposite what users would expect.

From (<- typo edit) our discussion with @dstoekl at least, and others who understand UN, it seems that they expect the same UN to be used when fine-tuning.

Let's say you're taking a base model trained with any of the canonical normalizations and add new labels to get closer to a diplomatic transcription these would often be normalized away without apparent reason.

I think I get your example, but I am not sure I agree with it.

For sure, I see and issue on not having normalization when NFD was used during training: if I correct n into ñ but half of the ñ were correctly predicted as n + ◌̃, fine-tuning will then need to learn to different class (or series of classes) for the same graphem.

I understand though the issue with K, but again, except if the user (and they don't) replaced and recorrected everything in the text, I think it the original model setting should be reused, to avoid duplication of classes for identical phenomenon.

mittagessen commented 9 months ago

I understand though the issue with K, but again, except if the user (and they don't) replaced and recorrected everything in the text, I think it the original model setting should be reused, to avoid duplication of classes for identical phenomenon.

The sole issue in my opinion are the canonical normalizations in the case of fine-tuning to add some new characters to a base model. I think there's a fairly high chance that the canonical decompositions will result in those exact code points being normalized away. The best example would be a user wanting to add sub-/superscript numerals to a base model (or teach it to encode them as the sub-/superscript code points instead of simple numerals if the dataset for the base model contained sub-/superscripts). So they prepare a new fine-tuning dataset with the requisite changes to the transcription guidelines. If we silently just apply the normalization of the base model when fine-tuning on this dataset all the sub-/superscripts will disappear.

I'd say this is fairly confusing to anyone who hasn't read through UnicodeData.txt on a particular boring Sunday afternoon.