ocropus-archive / DUP-ocropy

Python-based tools for document analysis and OCR
Apache License 2.0
3.41k stars 590 forks source link

Add NFC and NFD normalization options (keeping NFKC as the default) #313

Open nickjwhite opened 6 years ago

nickjwhite commented 6 years ago

I'm proposing this as NFC/NFD normalization is definitely useful for some models, and this allows users to load models which use one of these normalizations. #257 (the proposal to use NFC by default) didn't get enough traction to merge, but this at least allows those of us who benefit from alternative normalization to distribute our models to users, without having to ask them to apply a patch.

While NFKC is kept as the default, this gives the option to use NFC and NFD normalization options. These can't be used directly, but allow a model that has been trained with an alternative normalization to be loaded and used. Without this patch, such a model will throw an error when unpickling such a model.

Such a model can be built using different normalization= parameters, as for example in #257.

kba commented 6 years ago

Since this doesn't change the behavior, I see no reason not to merge this.

Sorry about #257 being stalled, maybe @tmbdev and/or @zuphilip can chime in to make sure we don't break anything.