Closed dedekj closed 7 years ago
Yes, you are right. The current tagging algorithm is the following:
That is why your example takes so long -- it contains a lot of consecutive words unknown to the morphological dictionary.
As for how to improve the situation:
We are planning the 3. in the near future (using recurrent neural networks), which is why I do not want to implement 1. However, for the time being, we could do 2. -- the current guesser is quite old, we could train another one with a limit on the number of returned analyses.
I will think about it for some more. Comments welcome.
We are currently planning to release new MorphoDiTa models (using the current algorithms) which should improve the situation. By improving the situation I mean that texts with lots of unknown words will still be slower than texts with known words, but not by such a large margin than today (imagine 300-500ms for the sentence you describe, instead of 3s). The timeframe is ~month.
Than in longer prospect (~6-12 months) we plan to release new major release with a complete different algorithm, which should be nearly insensitive to the text being processed.
Seems good. We are not struggling with the issue much. I mainly wanted to let you know and get some insight about what is happening... Looking forward the new improvements :-)
We have just released new MorphoDiTa models, which improve the performace of the guesser. For the example in this issue:
Model | Guesser | Time |
---|---|---|
czech-morfflex-pdt-160310.tagger |
no | 1ms |
czech-morfflex-pdt-160310.tagger |
yes | 3167ms |
czech-morfflex-pdt-161115.tagger |
no | 1ms |
czech-morfflex-pdt-161115.tagger |
yes | 44ms |
The text with unknown words is still slower when using the guesser, but it is nearly two orders faster than before. In the future we hope to improve the situation by changing the tagging algorithm, but that is planned for Q2 or Q3 of 2017.
I noticed that that the tagger with guesser enabled is sometimes very slow. For example the 100-tokens-long sentence bellow took the czech tagger about 3s on my laptop. And it took only 55 ms without the guesser.
Could be the performance of the guesser improved?
(Sorry I know its actually Slovak, but sometimes the data is not clean...)
Imunogény pozostávajú z obalových polypeptidov E vírusov s mol. hmot. cca 57 000 hm. j., s nasledujúcim sledom aminokyselín (KE): SRCTHLENRD FVTGTQGTTR VTL VLELGGC VTITAEGKPS MDVWLDATYQ ENPAKTREYC LHAKLSDTKV AARCPT MGPA TLAEEHQGGT VKVEPHTGDY VAANETHSGR KTASFTISSE KTTLTMGEYG DVSL LCRVAS GVDLAQTVIL ELDKTVEHLP TAWQVHRDWF NDLALPWHKE GAQNWNNA ER LVEFGAPHAV KMDVYNLGDQ TGVLLKALAG VPVAHIEGTK YHLKSGHVTC EVGLEKLKMK GLTYTMCDKT KFTWKRAPTD SGHDTVVMEV TFSGTKPCRI PVRA VAHGSP DVNVAMLITP NPTIENNGGG FIEMQLPPGD NIIYVGELSH QWFOKGSS TG RVFQKTKKGI ERLTVIGEHA WDFGSAGGFL SSIGKAVHTV LGGAFNSIFG GVGFLPKLLL GVALAWLGLN MRNPTMSMSF LLAGGLVLAM GLGVGA, ktoré sú komplexne viazané svojími hydrofóbnymi C-koncami s bakteriálnymi proteozónami.