Closed fattynoparents closed 5 months ago
Hi,
Please look at: https://github.com/knaw-huc/loghi-tooling and then the section about "MinionDetectLanguageOfPageXml"
My experience is that it works well on a page level, but is a bit shaky on the line level. It really is advised to use your own "training" files for the language detection. The language files that are used by default are just an example and we created them within 10 minutes without too much effort. It should be easy to improve. In order to get correct pagexml the language training files should be named conforming to the pagexml specification => which says ISO 639.x for the 2019 format. https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd
so in your case create create the following files containing text data of that language: /languagefiles/Swedish /languagefiles/German /languagefiles/English /languagefiles/French /languagefiles/Italian
you might need to insert -lang_train_data /languagefiles/ \ in the inferencescript at line ~276 so it becomes $DOCKERLOGHITOOLING /src/loghi-tooling/minions/target/appassembler/bin/MinionDetectLanguageOfPageXml \ -lang_train_data /languagefiles/ \ -page $IMAGES_PATH/page/ \
and a few lines above that add an extra docker mapping: -v /languagefiles/:/languagefiles/ \
Thanks for such a detailed answer. Yes, I have experimented with those files as you describe, but it seems that, as you write, the language recognition is a bit shaky on a baseline level. So for example if the language of the whole page is correctly defined as Swedish, separate baselines can be detected as German, English and what not. So does this incorrect detection influence the transcription? Or should one maybe create really good training files to get the most correct language detection?
It is a postprocessing step and runs after the HTR. It has no influence on the HTR.
It is a postprocessing step and runs after the HTR. It has no influence on the HTR.
Right! Stupid me. Then it might not be needed after all in our case. Thank you for fast replies!
Is it worth running the inference-pipeline.sh code with the DETECTLANGUAGE flag set to 1?
How accurate is the MinionDetectLanguageOfPageXml minion in detecting the language?
From what I've seen so far, there is a lot of incorrect detection if I have several language files (languages I used were Swedish, German, English, French, Italian)