Open YanLiang1102 opened 6 years ago
It is converting it to Prodigy format before putting it into the DB. See here.
@ahalterman yeah I got this, but the problem is it is looking at the annotations format that directly from ontoNotes, but not the bilou format, when I passed in the BILOU format data it returned 0 records being transferred, but if ontoNOtes format everything got transferred.
I was confused: the current rehearsal.py uses CoNLL format, not BILOU. Change rehearsal.py to handle BILOU formats, too.
@ahalterman do you get it now Andy? we need rehearsal to mixed in Bilou with Prodigy not Cornll with Prodigy.
I just added some code to do this, along with the code needed to use Arabic. (It was giving me some major git errors when I tried to put this in master). I realized I'm still confused, through: Prodigy doesn't handle BILOU, only spans. So are you training with spaCy or Prodigy for this step?
@ahalterman so the problem is I am using Prodigy to train, but as you said Prodigy only look at the CONLL format not BILOU, so our previous effort like merge in the tag class and attache AnerCorp all in vein, since it does not look at the cleaned BILOU format, I was like is there any quick and dirty way to change the BILOU format into CONLL instead of directly looking at the raw "ontoNOtes data" , in that way Prodigy can directly look at it. since otherwise we need to do the preprocessing again on the conll format before we can make it to train on Prodigy. Does it make sense this time? :)
🤦♂️ So we need it to go from BILOU to Prodigy format...got it. Sorry about my confusion!
@ahalterman no problem, :)
better to make it look at bilou format and change to prodigy format since if in OntoNotes format it does not take advantage of the ner tag merged and anercorp data merged that we already worked on.