Support for gazetteers in *->EN translation directions.
It is possible to specify what kind of gazetteer items should be used by changing the "gazetteer" parameter.
gazetteer=0 ... no gazetteer
gazetteer=all ... all items in the gazetteer
gazetteer=vlc ... only items extracted from VLC (their identifier starts with "vlc")
gazetteer=vlc,libreoffice,kde,batch1a ... items specified by these prefixes
gazetteer=wiki ... items extracted from Wiki by Rosa, all values of 'deep'
gazetteer=wiki_deep1,wiki_deep2 ... items extracted from Wiki by Rosa, only the depth of 1 and 2
Results for CS<->EN, ES<->EN and NL<->EN using various gazetteer subsets are as follows:
The results for translation from English are very similar for all the three languages.
The basis of improvement is formed by the combination of vlc, libreoffice, kde and wiki improves a bit on top of it.
Note that the ordering of items in gazetteers matters, since only the first occurrences of expressions in the source list
are taken into consideration. The used gazetteers comply with the following ordering:
vlc, libreoffice, kde, batch1a, wiki_deep1, wiki_deep2, wiki_deep3, wiki_deep4, wiki_deep5, wiki_deep6, wiki_deep7
The fact that including wiki gazetteer helps more if it follows the vlc, libreoffice, kde combination than if used
alone is probably related to the way how gazetteers are ordered.
Unlike in the translation from English, the combination of vlc, libreoffice and kde proved to worsen
the translation quality in the opposite direction for all the source languages. This might result from
different distribution of named entities in user questions. Software texts, addressed by localtization
files, are expected to appear rarely, while names of software, addressed by wiki titles, should be more frequent.
Nevertheless, this does not work out for translation from Dutch, where gazetteers brought no improvement.
Support for gazetteers in *->EN translation directions. It is possible to specify what kind of gazetteer items should be used by changing the "gazetteer" parameter.
gazetteer=0
... no gazetteergazetteer=all
... all items in the gazetteergazetteer=vlc
... only items extracted from VLC (their identifier starts with "vlc")gazetteer=vlc,libreoffice,kde,batch1a
... items specified by these prefixesgazetteer=wiki
... items extracted from Wiki by Rosa, all values of 'deep'gazetteer=wiki_deep1,wiki_deep2
... items extracted from Wiki by Rosa, only the depth of 1 and 2Results for CS<->EN, ES<->EN and NL<->EN using various gazetteer subsets are as follows:
EN->CS:
EN->ES:
EN->NL:
The results for translation from English are very similar for all the three languages. The basis of improvement is formed by the combination of
vlc
,libreoffice
,kde
andwiki
improves a bit on top of it. Note that the ordering of items in gazetteers matters, since only the first occurrences of expressions in the source list are taken into consideration. The used gazetteers comply with the following ordering:vlc, libreoffice, kde, batch1a, wiki_deep1, wiki_deep2, wiki_deep3, wiki_deep4, wiki_deep5, wiki_deep6, wiki_deep7
The fact that includingwiki
gazetteer helps more if it follows thevlc
,libreoffice
,kde
combination than if used alone is probably related to the way how gazetteers are ordered.CS->EN:
ES->EN:
NL->EN:
Unlike in the translation from English, the combination of
vlc
,libreoffice
andkde
proved to worsen the translation quality in the opposite direction for all the source languages. This might result from different distribution of named entities in user questions. Software texts, addressed by localtization files, are expected to appear rarely, while names of software, addressed by wiki titles, should be more frequent. Nevertheless, this does not work out for translation from Dutch, where gazetteers brought no improvement.