qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
58 stars 12 forks source link

Ignore BOM #80

Closed mikegerber closed 1 year ago

mikegerber commented 1 year ago

If I create an empty file gt.txt (0 bytes) and a file ocr.txt that only contains a BOM (3 bytes), dinglehopper computes a CER of Infinity. It should ignore the BOM.

❯ ./reproduce
+ echo -ne ''
+ echo -ne '\xEF\xBB\xBF'
+ ls -l gt.txt ocr-just-bom.txt
-rw-r--r-- 1 b-mg106 b-mg106 0 Apr 20 19:58 gt.txt
-rw-r--r-- 1 b-mg106 b-mg106 3 Apr 20 19:58 ocr-just-bom.txt
+ dinglehopper gt.txt ocr-just-bom.txt
+ grep cer report.json
    "cer": Infinity,

See also #79.

mikegerber commented 1 year ago

Tested on Python 3.11.3, Windows WSL Debian

mikegerber commented 1 year ago

Reproducer:

❯ cat reproduce
#!/bin/bash

# Must be run in bash for "echo -ne" to work. (not sh!)

set -x
echo -ne "This is a test." > gt.txt
echo -ne "\xEF\xBB\xBFThis is a test." > ocr-just-bom.txt
ls -l *.txt
dinglehopper gt.txt ocr-just-bom.txt
grep cer report.json   # should be 0