Can be the performance of the guesser improved?

dedekj commented 8 years ago

I noticed that that the tagger with guesser enabled is sometimes very slow. For example the 100-tokens-long sentence bellow took the czech tagger about 3s on my laptop. And it took only 55 ms without the guesser.

Could be the performance of the guesser improved?

(Sorry I know its actually Slovak, but sometimes the data is not clean...)

Imunogény pozostávajú z obalových polypeptidov E vírusov s mol. hmot. cca 57 000 hm. j., s nasledujúcim sledom aminokyselín (KE): SRCTHLENRD FVTGTQGTTR VTL VLELGGC VTITAEGKPS MDVWLDATYQ ENPAKTREYC LHAKLSDTKV AARCPT MGPA TLAEEHQGGT VKVEPHTGDY VAANETHSGR KTASFTISSE KTTLTMGEYG DVSL LCRVAS GVDLAQTVIL ELDKTVEHLP TAWQVHRDWF NDLALPWHKE GAQNWNNA ER LVEFGAPHAV KMDVYNLGDQ TGVLLKALAG VPVAHIEGTK YHLKSGHVTC EVGLEKLKMK GLTYTMCDKT KFTWKRAPTD SGHDTVVMEV TFSGTKPCRI PVRA VAHGSP DVNVAMLITP NPTIENNGGG FIEMQLPPGD NIIYVGELSH QWFOKGSS TG RVFQKTKKGI ERLTVIGEHA WDFGSAGGFL SSIGKAVHTV LGGAFNSIFG GVGFLPKLLL GVALAWLGLN MRNPTMSMSF LLAGGLVLAM GLGVGA, ktoré sú komplexne viazané svojími hydrofóbnymi C-koncami s bakteriálnymi proteozónami.

foxik commented 8 years ago

Yes, you are right. The current tagging algorithm is the following:

the results of morphological analysis are disambiguated by averaged perceptron using viterbi decoding, usually of order 3
therefore, the complexity of tagging is roughly O(number_words * average_analyses^3)
the guesser returns quite a lot (for uppecase words, even more than 50) of (usually bad) matches
therefore, if there are three such consecutive words, the time complexity gets to ~50^3 for a single word

That is why your example takes so long -- it contains a lot of consecutive words unknown to the morphological dictionary.

As for how to improve the situation:

we could incorporate some prunning to the tagging algorithm [to make the complexity O(min(200, analyses^3)) for one word]
we could make guesser return less analyses [to make the complexity O(min(10, analyses)^3) for one word]
we could change the tagging algorithm to another one not so sensitive to the number of guessed analyses [for example, with only linear dependence on the number of analysis]

We are planning the 3. in the near future (using recurrent neural networks), which is why I do not want to implement 1. However, for the time being, we could do 2. -- the current guesser is quite old, we could train another one with a limit on the number of returned analyses.

I will think about it for some more. Comments welcome.

foxik commented 8 years ago

We are currently planning to release new MorphoDiTa models (using the current algorithms) which should improve the situation. By improving the situation I mean that texts with lots of unknown words will still be slower than texts with known words, but not by such a large margin than today (imagine 300-500ms for the sentence you describe, instead of 3s). The timeframe is ~month.

Than in longer prospect (~6-12 months) we plan to release new major release with a complete different algorithm, which should be nearly insensitive to the text being processed.

dedekj commented 8 years ago

Seems good. We are not struggling with the issue much. I mainly wanted to let you know and get some insight about what is happening... Looking forward the new improvements :-)

foxik commented 7 years ago

We have just released new MorphoDiTa models, which improve the performace of the guesser. For the example in this issue:

Model	Guesser	Time
`czech-morfflex-pdt-160310.tagger`	no	1ms
`czech-morfflex-pdt-160310.tagger`	yes	3167ms
`czech-morfflex-pdt-161115.tagger`	no	1ms
`czech-morfflex-pdt-161115.tagger`	yes	44ms

The text with unknown words is still slower when using the guesser, but it is nearly two orders faster than before. In the future we hope to improve the situation by changing the tagging algorithm, but that is planned for Q2 or Q3 of 2017.

ufal / morphodita

Can be the performance of the guesser improved? #9