Open khaledJabr opened 6 years ago
u mean what label it could be? @khaledJabr
https://github.com/oudalab/Arabic-NER/issues/10 @khaledJabr check this issue all the NER class are here
It sounds like the thing to do is to rerun one of the simple models with some simple changes on the Arabic text:
Khaled's code above does all that, so I think we should run that over all the orth
s, retrain the model, and see how it goes. (We should get much better word embedding coverage after doing that)
well only fixed the "ort" token will bring us into exception since the algo looks at the position of the those token if we get rid of the extra space or whatever stuff from the token but the original text not change with it we will come into this error, I look into the training data, it does not have the start and end index in it, but it still uses it somehow, so when we delete the "useless" stuff, it is not working, @khaledJabr @ahalterman wonder if you would like to jump in and clean the raw text Khaled since I could not read Arabic, will not able to do this, I will point those to u, so the data you need to clean is here: data is under here on ### hanover
training:
/home/yan/arabicNER/nerdata/cleaned_combined_removed.json
eval data:
/home/yan/arabicNER/nerdata/ar_eval_all_cleaned_removed.json
here is the code that I have to clean up the token if you want to take a look @khaledJabr https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb
I had the chance to look at the training data we are using for this, and there are two main issues with the data:
The training data includes diacritics. Diacritics are extra short vowels added to Arabic words to help with pronunciation and differentiating the means of two matching words or more, and usually this is needed on the lemma level. Diacritics are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface. I am suspecting that this might be one of the reasons hurting the NER model. One really important thing to look at here is to see whether the word embeddings that we are using were trained on data with diacritics or not. I don't have a clear answer of how this have or could have affected our training, but my main intuition is normalizing/standardizing our data as much as we can is a always a good thing.
Aside from the diacritics, I have noticed that most (if not all) of the tokens (the actual tokens, the ones stored as
orth
) have an extra space at the end of them, plus a lot of them have weird extra characters. Here are some examples :Although many of these have a ner label of
o
,i still think they are worth fixing. Here is how I would go about fixing both issues (there are other ways, but this is the first thing that comes to mind):One last thing, do we have a key or a table somewhere that lists the labels we are using in our big NER dataset (the combined one)?