weird output for Arabic sentences

ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files

Mozilla Public License 2.0

358 stars 75 forks source link

weird output for Arabic sentences #20

Closed JingL1014 closed 7 years ago

JingL1014 commented 7 years ago

Hi, Thanks for developing this useful tool. Recently, i am trying to use UDpipe on Arabic documents. I have tried the pre-trained model for UD1.2 and i also trained the model on UD1.2 treebank myself. But the pos tags returned by the tool are always "X". As a result the dependency parser returned wrong output as well. Could you help me with this problem? thanks! udpipe_arabic

JingL1014 commented 7 years ago

i have also tried running dependency parser only given pos tags, the result is correct. So i think there might be something wrong with the pos tagger.

dan-zeman commented 7 years ago

This resembles an issue that has been discussed in an e-mail thread. Avoid models trained on UD 1.2, try UD 1.4 instead. The older releases of the treebank had word forms with vowel diacritics. Such a tagger cannot tag normal Arabic text, because without the diacritics all words are unknown.

JingL1014 commented 7 years ago

Thanks for your quick response. I will train the model on UD1.4.

foxik commented 7 years ago

As @dan-zeman wrote, the issue is with vocalization -- UD 1.2 training data was fully vocalized, so the UD 1.2 models require full diacritics.

The UD 1.4 Arabic data are already not vocalized, so training using UD 1.4 improves the situation (i.e., it works for unvocalized input).

We are also implementing an Arabic normalization algoritm to UDPipe and will release UD 1.4 models using that algorithm, which will allow to parse vocalized, unvocalized or semi-vocalized inputs.

foxik commented 7 years ago

Leaving open until we release working Arabic model.

foxik commented 7 years ago

The UDPipe 1.1 and the CoNLL2017 Arabic model should work better -- they remove vocalization marks before tagging and parsing.