oudalab / Arabic-NER

32 stars 11 forks source link

Cleaning code #19

Open ahalterman opened 6 years ago

ahalterman commented 6 years ago

@YanLiang1102, can you post the code that produces combined_cleaned_removed (from exp 5)? Then @khaledJabr can take a look and we can make sure all the data's in the right/same format.

YanLiang1102 commented 6 years ago
def cleanupToken(data):
    token_count=0;
    for d in data:
        para=d['paragraphs'][0];
        for sen in para['sentences']:
            for tok in sen['tokens']:
                token_count+=1;
                tok['orth']=cleantext(tok['orth'])
    return token_count;

https://github.com/oudalab/Arabic-NER/blob/master/explore_traingdata.ipynb it is at the end of this ipynb @khaledJabr @ahalterman

YanLiang1102 commented 6 years ago

here is the command for transfer ontoNotes format to BILOU format

python onto_to_spacy_json.py -i "ontonotes-release-5.0/data/arabic/annotations/nw/ann/00" -t "ar_train.json" -e "ar_eval.json" -v 0.1
YanLiang1102 commented 6 years ago

@khaled I will post the ontoNotes raw data to you tomorrow it is on my lab computer.

YanLiang1102 commented 6 years ago

@ahalterman Hi Andy do you still have the LDC raw data, I did not find it on my local, did not remember where I put it, we can give that to Khaled for him to take a look.

ahalterman commented 6 years ago

Just sent you and Khaled a message.

YanLiang1102 commented 6 years ago

@khaled @ahalterman so I use the onto_spacy_json.py to convert the Conll format to BILOU, for anercorp I just made all the tagging into one documents, and append to the LDC one, LDC has 401 docs, Anercorp just one,Anercorp does not have a lot of token, we can ignore that for now, if you want to check LDC is right or not just use the first 401 docs, let me know if you need more info on this

And after that I merge the tag into common ones, with the tag label both in anercorp and LDC the data are here:/home/yan/nerdata on Manchester

ar_eval_all.json ar_train_all.json( these two without merge tag without remove any diacritics)

ar_eval_all_cleaned.json combined.json (there two has the merged tag, get rid of the last doc in combined json you can just look at the first 401 docs, the last one is Anercorp) cleaned_combined_removed.json (this is the merged tag and removed diacritics version)

khaled commented 6 years ago

@YanLiang1102, FYI you're mentioning the wrong khaled - I have no connection with this project :-)

YanLiang1102 commented 6 years ago

@khaledJabr Hey Khaled I hope u saw the stuff, I mentioned a wrong Khaled, :P