ufal / treex

Treex NLP framework
33 stars 6 forks source link

Gazetteer #16

Closed michnov closed 8 years ago

michnov commented 9 years ago

Support for gazetteers in *->EN translation directions. It is possible to specify what kind of gazetteer items should be used by changing the "gazetteer" parameter. gazetteer=0 ... no gazetteer gazetteer=all ... all items in the gazetteer gazetteer=vlc ... only items extracted from VLC (their identifier starts with "vlc") gazetteer=vlc,libreoffice,kde,batch1a ... items specified by these prefixes gazetteer=wiki ... items extracted from Wiki by Rosa, all values of 'deep' gazetteer=wiki_deep1,wiki_deep2 ... items extracted from Wiki by Rosa, only the depth of 1 and 2

Results for CS<->EN, ES<->EN and NL<->EN using various gazetteer subsets are as follows:

EN->CS:

28.55   6.8371  Scen::EN2CS domain=IT resegment=1 gazetteer=0
31.31   7.2574  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde
31.35   7.2609  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a
28.58   6.8391  Scen::EN2CS domain=IT resegment=1 gazetteer=wiki_deep1
28.62   6.8441  Scen::EN2CS domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2
28.58   6.8417  Scen::EN2CS domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3
28.59   6.8405  Scen::EN2CS domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
28.68   6.8426  Scen::EN2CS domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
28.85   6.8468  Scen::EN2CS domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
28.86   6.8431  Scen::EN2CS domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
28.86   6.8431  Scen::EN2CS domain=IT resegment=1 gazetteer=wiki
31.38   7.2665  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1
31.4    7.2688  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2
31.4    7.2709  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3
31.65   7.3002  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
31.72   7.3038  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
31.77   7.3129  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
31.79   7.3158  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
31.79   7.3158  Scen::EN2CS domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki
31.79   7.3158  Scen::EN2CS domain=IT resegment=1 gazetteer=all

EN->ES:

26.52   6.8011  Scen::EN2ES domain=IT resegment=1 gazetteer=0
28.15   7.0629  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde
26.52   6.8000  Scen::EN2ES domain=IT resegment=1 gazetteer=wiki_deep1
26.55   6.7786  Scen::EN2ES domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2
26.67   6.7838  Scen::EN2ES domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3
26.7    6.7635  Scen::EN2ES domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
26.83   6.7660  Scen::EN2ES domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
26.88   6.7531  Scen::EN2ES domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
26.88   6.7531  Scen::EN2ES domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
26.88   6.7531  Scen::EN2ES domain=IT resegment=1 gazetteer=wiki
28.15   7.0629  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1
28.22   7.0678  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2
28.32   7.0750  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3
28.33   7.0773  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
28.52   7.0904  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
28.62   7.0969  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
28.62   7.0969  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
28.62   7.0969  Scen::EN2ES domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki
28.62   7.0969  Scen::EN2ES domain=IT resegment=1 gazetteer=all

EN->NL:

24.22   6.3541  Scen::EN2NL domain=IT resegment=1 gazetteer=0
25.2    6.4676  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde
24.22   6.3548  Scen::EN2NL domain=IT resegment=1 gazetteer=wiki_deep1
24.03   6.3192  Scen::EN2NL domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2
24.22   6.3245  Scen::EN2NL domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3
24.31   6.3297  Scen::EN2NL domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
24.38   6.3444  Scen::EN2NL domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
24.4    6.3312  Scen::EN2NL domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
24.34   6.3259  Scen::EN2NL domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
24.34   6.3269  Scen::EN2NL domain=IT resegment=1 gazetteer=wiki
25.64   6.5217  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki
25.2    6.4676  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1
25.2    6.4701  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2
25.49   6.4969  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3
25.57   6.5103  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
25.66   6.5266  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
25.65   6.5225  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
25.64   6.5217  Scen::EN2NL domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
25.63   6.5232  Scen::EN2NL domain=IT resegment=1 gazetteer=all

The results for translation from English are very similar for all the three languages. The basis of improvement is formed by the combination of vlc, libreoffice, kde and wiki improves a bit on top of it. Note that the ordering of items in gazetteers matters, since only the first occurrences of expressions in the source list are taken into consideration. The used gazetteers comply with the following ordering: vlc, libreoffice, kde, batch1a, wiki_deep1, wiki_deep2, wiki_deep3, wiki_deep4, wiki_deep5, wiki_deep6, wiki_deep7 The fact that including wiki gazetteer helps more if it follows the vlc, libreoffice, kde combination than if used alone is probably related to the way how gazetteers are ordered.


CS->EN:

28.59   6.9509  Scen::CS2EN domain=IT resegment=1 gazetteer=0
28.48   6.9349  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde
28.48   6.9349  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a
28.61   6.9530  Scen::CS2EN domain=IT resegment=1 gazetteer=wiki_deep1
28.51   6.9355  Scen::CS2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2
28.52   6.9455  Scen::CS2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3
28.8    6.9505  Scen::CS2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
29.14   6.9759  Scen::CS2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
29.14   6.9824  Scen::CS2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
29.05   6.9671  Scen::CS2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
29.05   6.9671  Scen::CS2EN domain=IT resegment=1 gazetteer=wiki
28.49   6.9353  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1
28.48   6.9275  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2
28.49   6.9377  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3
28.58   6.9454  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
28.85   6.9633  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
29.06   6.9773  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
28.97   6.9625  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
28.98   6.9637  Scen::CS2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,batch1a,wiki
28.98   6.9637  Scen::CS2EN domain=IT resegment=1 gazetteer=all

ES->EN:

20.92   5.9837  Scen::ES2EN domain=IT resegment=1 gazetteer=0
19.24   5.8512  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde
20.92   5.9837  Scen::ES2EN domain=IT resegment=1 gazetteer=wiki_deep1
21.05   5.9730  Scen::ES2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2
21.2    5.9928  Scen::ES2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3
21.2    5.9869  Scen::ES2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
21.35   5.9840  Scen::ES2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
21.42   5.9929  Scen::ES2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
21.42   5.9929  Scen::ES2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
21.42   5.9929  Scen::ES2EN domain=IT resegment=1 gazetteer=wiki
19.24   5.8512  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1
20.09   5.9001  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2
20.21   5.9123  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3
20.19   5.9042  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
20.46   5.9257  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
20.55   5.9409  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
20.55   5.9409  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
20.55   5.9409  Scen::ES2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki
20.55   5.9409  Scen::ES2EN domain=IT resegment=1 gazetteer=all

NL->EN:

39.45   7.5888  Scen::NL2EN domain=IT resegment=1 gazetteer=0
38.83   7.5226  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde
39.45   7.5888  Scen::NL2EN domain=IT resegment=1 gazetteer=wiki_deep1
39.46   7.5901  Scen::NL2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2
39.19   7.5599  Scen::NL2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3
38.83   7.5301  Scen::NL2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
38.86   7.5313  Scen::NL2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
38.94   7.5409  Scen::NL2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
38.94   7.5394  Scen::NL2EN domain=IT resegment=1 gazetteer=wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
38.94   7.5397  Scen::NL2EN domain=IT resegment=1 gazetteer=wiki
38.83   7.5226  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1
38.83   7.5226  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2
38.82   7.5178  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3
38.67   7.5043  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4
38.64   7.5041  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5
38.76   7.5149  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6
38.76   7.5149  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki_deep1,wiki_deep2,wiki_deep3,wiki_deep4,wiki_deep5,wiki_deep6,wiki_deep7
38.75   7.5143  Scen::NL2EN domain=IT resegment=1 gazetteer=vlc,libreoffice,kde,wiki
38.75   7.5143  Scen::NL2EN domain=IT resegment=1 gazetteer=all

Unlike in the translation from English, the combination of vlc, libreoffice and kde proved to worsen the translation quality in the opposite direction for all the source languages. This might result from different distribution of named entities in user questions. Software texts, addressed by localtization files, are expected to appear rarely, while names of software, addressed by wiki titles, should be more frequent. Nevertheless, this does not work out for translation from Dutch, where gazetteers brought no improvement.