qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

Review error rate definitions etc. #45

Open mikegerber opened 3 years ago

bertsky commented 3 years ago

I suggest to implement alignment path length as denominator instead of the GT length (which can be >1):

https://github.com/qurator-spk/dinglehopper/blob/249787686f554ceee4a14c2610772095320d912a/qurator/dinglehopper/character_error_rate.py#L24-L28

(Ideally, you implement all 3 length options: alignment path, maximum sequence, GT sequence.)

The problem for dinglehopper is that your levenshtein_matrix does not give you the alignment path, you only have the resulting minimum distance.

bertsky commented 5 months ago

Update: I recommend using rapidfuzz's normalized_distance instead of just dividing distance by the GT length. Internally (in the CPP backend) the denominator is calculated as the actual path length (=maximum distance).