Arabic data issue and potential fixes

khaledJabr commented 6 years ago

I had the chance to look at the training data we are using for this, and there are two main issues with the data:

The training data includes diacritics. Diacritics are extra short vowels added to Arabic words to help with pronunciation and differentiating the means of two matching words or more, and usually this is needed on the lemma level. Diacritics are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface. I am suspecting that this might be one of the reasons hurting the NER model. One really important thing to look at here is to see whether the word embeddings that we are using were trained on data with diacritics or not. I don't have a clear answer of how this have or could have affected our training, but my main intuition is normalizing/standardizing our data as much as we can is a always a good thing.
Aside from the diacritics, I have noticed that most (if not all) of the tokens (the actual tokens, the ones stored as orth) have an extra space at the end of them, plus a lot of them have weird extra characters. Here are some examples :
```
'orth': 'ال{ِسْتِسْلامُ '
'orth': '-مُعالَجَةِ '
'orth': '-{ِعْتِباراتِ- '
```
Although many of these have a ner label of o ,i still think they are worth fixing. Here is how I would go about fixing both issues (there are other ways, but this is the first thing that comes to mind):

import re 
import pyarabic.araby as araby 

text =   '-آمِلَةً '
no_diacritics = araby.strip_tashkeel(text) # removes all diacritics 
just_arabic_text = re.sub(r'\W+', '',no_diacritics ) # removes everything else but the word. This assumes there's only one word in orth 
just_arabic_text

output : 
آملة

One last thing, do we have a key or a table somewhere that lists the labels we are using in our big NER dataset (the combined one)?

YanLiang1102 commented 6 years ago

u mean what label it could be? @khaledJabr

YanLiang1102 commented 6 years ago

https://github.com/oudalab/Arabic-NER/issues/10 @khaledJabr check this issue all the NER class are here

ahalterman commented 6 years ago

It sounds like the thing to do is to rerun one of the simple models with some simple changes on the Arabic text:

remove the diacritics
remove leading/trailing spaces
remove other junk like hyphens.

Khaled's code above does all that, so I think we should run that over all the orths, retrain the model, and see how it goes. (We should get much better word embedding coverage after doing that)

YanLiang1102 commented 6 years ago

well only fixed the "ort" token will bring us into exception since the algo looks at the position of the those token if we get rid of the extra space or whatever stuff from the token but the original text not change with it we will come into this error, I look into the training data, it does not have the start and end index in it, but it still uses it somehow, so when we delete the "useless" stuff, it is not working, @khaledJabr @ahalterman wonder if you would like to jump in and clean the raw text Khaled since I could not read Arabic, will not able to do this, I will point those to u, so the data you need to clean is here: data is under here on ### hanover

training:
/home/yan/arabicNER/nerdata/cleaned_combined_removed.json
eval data:
/home/yan/arabicNER/nerdata/ar_eval_all_cleaned_removed.json

YanLiang1102 commented 6 years ago

here is the code that I have to clean up the token if you want to take a look @khaledJabr https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb

oudalab / Arabic-NER

Arabic data issue and potential fixes #15