impactcentre / ocrevalUAtion

OCR evaluation brought to you by University of Alicante
Apache License 2.0
66 stars 27 forks source link

Failure to evaluate PAGE-XML with namespace prefix #26

Open kba opened 2 years ago

kba commented 2 years ago

When evaluating PAGE-XML that has a namespace prefix, as is the case for OCR-D, evaluation fails with

Exception in thread "main" eu.digitisation.utils.input.WarningException: Unsupported file format (UNKNOWN format) for file OCR-D-OCR-TESS-ONLY_0001.xml                      
        at eu.digitisation.text.Text.<init>(Text.java:121)                                                                                                                   
        at eu.digitisation.text.Text.<init>(Text.java:153)                                                                                                                   
        at eu.digitisation.output.Report.<init>(Report.java:117)                                                                                                             
        at eu.digitisation.Main.main(Main.java:99)     

If I change

<pc:PcGts xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ...

to

<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

using

sed -i 's,pc:,,g' $f         
sed -i 's,xmlns:pc,xmlns,' $f

evaluation works as expected