oudalab / Arabic-NER

32 stars 11 forks source link

Rehearsal.py is using ontoNotes raw format not bilou format #17

Open YanLiang1102 opened 6 years ago

YanLiang1102 commented 6 years ago

better to make it look at bilou format and change to prodigy format since if in OntoNotes format it does not take advantage of the ner tag merged and anercorp data merged that we already worked on.

ahalterman commented 6 years ago

It is converting it to Prodigy format before putting it into the DB. See here.

YanLiang1102 commented 6 years ago

@ahalterman yeah I got this, but the problem is it is looking at the annotations format that directly from ontoNotes, but not the bilou format, when I passed in the BILOU format data it returned 0 records being transferred, but if ontoNOtes format everything got transferred.

ahalterman commented 6 years ago

I was confused: the current rehearsal.py uses CoNLL format, not BILOU. Change rehearsal.py to handle BILOU formats, too.

YanLiang1102 commented 6 years ago

@ahalterman do you get it now Andy? we need rehearsal to mixed in Bilou with Prodigy not Cornll with Prodigy.

ahalterman commented 6 years ago

I just added some code to do this, along with the code needed to use Arabic. (It was giving me some major git errors when I tried to put this in master). I realized I'm still confused, through: Prodigy doesn't handle BILOU, only spans. So are you training with spaCy or Prodigy for this step?

YanLiang1102 commented 6 years ago

@ahalterman so the problem is I am using Prodigy to train, but as you said Prodigy only look at the CONLL format not BILOU, so our previous effort like merge in the tag class and attache AnerCorp all in vein, since it does not look at the cleaned BILOU format, I was like is there any quick and dirty way to change the BILOU format into CONLL instead of directly looking at the raw "ontoNOtes data" , in that way Prodigy can directly look at it. since otherwise we need to do the preprocessing again on the conll format before we can make it to train on Prodigy. Does it make sense this time? :)

ahalterman commented 6 years ago

🤦‍♂️ So we need it to go from BILOU to Prodigy format...got it. Sorry about my confusion!

YanLiang1102 commented 6 years ago

@ahalterman no problem, :)