knaw-huc / loghi

MIT License
103 stars 16 forks source link

Detecting language on a page - does it influence the correctness of HTR recognition? #23

Closed fattynoparents closed 5 months ago

fattynoparents commented 5 months ago

Is it worth running the inference-pipeline.sh code with the DETECTLANGUAGE flag set to 1?

How accurate is the MinionDetectLanguageOfPageXml minion in detecting the language?

From what I've seen so far, there is a lot of incorrect detection if I have several language files (languages I used were Swedish, German, English, French, Italian)

rvankoert commented 5 months ago

Hi,

Please look at: https://github.com/knaw-huc/loghi-tooling and then the section about "MinionDetectLanguageOfPageXml"

My experience is that it works well on a page level, but is a bit shaky on the line level. It really is advised to use your own "training" files for the language detection. The language files that are used by default are just an example and we created them within 10 minutes without too much effort. It should be easy to improve. In order to get correct pagexml the language training files should be named conforming to the pagexml specification => which says ISO 639.x for the 2019 format. https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd

so in your case create create the following files containing text data of that language: /languagefiles/Swedish /languagefiles/German /languagefiles/English /languagefiles/French /languagefiles/Italian

you might need to insert -lang_train_data /languagefiles/ \ in the inferencescript at line ~276 so it becomes $DOCKERLOGHITOOLING /src/loghi-tooling/minions/target/appassembler/bin/MinionDetectLanguageOfPageXml \ -lang_train_data /languagefiles/ \ -page $IMAGES_PATH/page/ \

and a few lines above that add an extra docker mapping: -v /languagefiles/:/languagefiles/ \

fattynoparents commented 5 months ago

Thanks for such a detailed answer. Yes, I have experimented with those files as you describe, but it seems that, as you write, the language recognition is a bit shaky on a baseline level. So for example if the language of the whole page is correctly defined as Swedish, separate baselines can be detected as German, English and what not. So does this incorrect detection influence the transcription? Or should one maybe create really good training files to get the most correct language detection?

rvankoert commented 5 months ago

It is a postprocessing step and runs after the HTR. It has no influence on the HTR.

fattynoparents commented 5 months ago

It is a postprocessing step and runs after the HTR. It has no influence on the HTR.

Right! Stupid me. Then it might not be needed after all in our case. Thank you for fast replies!