newlines not removed in plain_extract

qurator-spk / dinglehopper

An OCR evaluation tool

Apache License 2.0

58 stars 12 forks source link

newlines not removed in plain_extract #107

Closed tallemeersch closed 3 months ago

tallemeersch commented 4 months ago

In ocr_files.py, line 170, readlines is performed. This method keeps the newlines, leading to incorrect CER score. Below is the current report given ground truth as txt and OCR as XML vs. the report when strip() is added to lines 170, i.e. make_segment(no, line.strip())

mikegerber commented 4 months ago

Thanks @tallemeersch! Could you upload the two files, I suspect there is an issue with the line endings (and I'm always interested in real user data)?

mikegerber commented 3 months ago

Nevermind, I believe this always happens. I'll look into it.

tallemeersch commented 3 months ago

Hereby the files attached. The command to produce the report was: dinglehopper --textequiv-level line 02_GT.txt 02.xml 02.zip

mikegerber commented 3 months ago

grafik

Fix is in git master and will be in the next release!

tallemeersch commented 3 months ago

Thanks!