Open ahalterman opened 6 years ago
def cleanupToken(data):
token_count=0;
for d in data:
para=d['paragraphs'][0];
for sen in para['sentences']:
for tok in sen['tokens']:
token_count+=1;
tok['orth']=cleantext(tok['orth'])
return token_count;
https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb it is at the end of this ipynb @khaledJabr @ahalterman
here is the command for transfer ontoNotes format to BILOU format
python onto_to_spacy_json.py -i "ontonotes-release-5.0/data/arabic/annotations/nw/ann/00" -t "ar_train.json" -e "ar_eval.json" -v 0.1
@khaled I will post the ontoNotes raw data to you tomorrow it is on my lab computer.
@ahalterman Hi Andy do you still have the LDC raw data, I did not find it on my local, did not remember where I put it, we can give that to Khaled for him to take a look.
Just sent you and Khaled a message.
@khaled @ahalterman so I use the onto_spacy_json.py to convert the Conll format to BILOU, for anercorp I just made all the tagging into one documents, and append to the LDC one, LDC has 401 docs, Anercorp just one,Anercorp does not have a lot of token, we can ignore that for now, if you want to check LDC is right or not just use the first 401 docs, let me know if you need more info on this
And after that I merge the tag into common ones, with the tag label both in anercorp and LDC the data are here:/home/yan/nerdata on Manchester
ar_eval_all.json ar_train_all.json( these two without merge tag without remove any diacritics)
ar_eval_all_cleaned.json combined.json (there two has the merged tag, get rid of the last doc in combined json you can just look at the first 401 docs, the last one is Anercorp) cleaned_combined_removed.json (this is the merged tag and removed diacritics version)
@YanLiang1102, FYI you're mentioning the wrong khaled - I have no connection with this project :-)
@khaledJabr Hey Khaled I hope u saw the stuff, I mentioned a wrong Khaled, :P
@YanLiang1102, can you post the code that produces
combined_cleaned_removed
(from exp 5)? Then @khaledJabr can take a look and we can make sure all the data's in the right/same format.