mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
724 stars 130 forks source link

Word error rate added in your release 4.3.10 #559

Closed soniasol closed 5 months ago

soniasol commented 9 months ago

Hello,

In your release 4.3.10, you mention that 'Word' error rate has been added as a validation metric in recognition training. Is it possible to get the WER score in the test report?

https://github.com/mittagessen/kraken/releases/tag/4.3.10

Thank you and have a nice day, Sonia

mittagessen commented 9 months ago

Ah sorry, it is only calculated for the validation during training. I can add it to the test report as well but the method is rather simplistic as it just considers anything separated by white space as a separate word.

soniasol commented 9 months ago

@mittagessen thank you so much for the swift reply!

I see your point, but I think it could be useful to have both CER and WER, in addition to the accuracy score.

For instance, in my case (but I am sure this applies to many people using Kraken!), we are going to use the OCR outputs as input to tokenization, lemmatization, normalization (e.g., old French to contemporary French) and so on. Therefore, having a metric to measure the errors in terms of words would be very helpful!

Do you think you could add the WER to the test report?

Thank you very much again for your work on Kraken 🙃 Have a nice day, Sonia.

mittagessen commented 5 months ago

It's implemented now on a global, i.e. not per-script, level. I'll add it to the next minor release 5.2.2.