oudalab / Arabic-NER

32 stars 11 forks source link

spaCy Training Performance With different configuration and set up #7

Open YanLiang1102 opened 6 years ago

YanLiang1102 commented 6 years ago

spaCy training output

'dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc', 'ents_p', 'ents_r', 'ents_f', 'cpu_wps', 'gpu_wps'
eval_data=eval_data
29, 0.000, 11.698, 0.000, 58.589, 49.871, 53.880, 91.894, 85.899, 15363.7, 0.0
eval_data-training_data performance (means the model does work)
29, 0.000, 13.261, 0.000, 81.946, 73.552, 77.523, 91.815, 85.866, 13092.9, 0.0
based on these data I think the model does work , just we don't have enough data. we only have 401 tagged documents in ontoNotes data. And all the entity tags are from that 401 documents.

exception that during training spaCy throw, and I made the code to eat the exception

[E067] Invalid BILUO tag sequence: Got a tag starting with 'I' (inside an entity) without a preceding 'B' (beginning of an entity). Tag sequence:
['O', 'U-GPE', 'O', 'B-EVENT', 'I-EVENT', 'L-EVENT', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'B-ORG', 'L-ORG', 'U-GPE" S_OFF="1', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'L-EVENT', 'U-ORDINAL', 'B-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'B-FAC', 'L-FAC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-NORP', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'U-DATE', 'O', 'U-GPE', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'O', 'B-GPE', 'I-GPE', 'L-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'U-GPE', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-FAC', 'L-FAC', 'O', 'U-GPE', 'O', 'O', 'B-TIME', 'I-TIME', 'L-TIME', 'U-DATE', 'O', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'U-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL', 'O', 'O', 'O', 'O', 'U-EVENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL" S_OFF="1', 'O', 'B-ORG', 'L-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'U-TIME" S_OFF="1', 'U-DATE', 'U-CARDINAL', 'B-FAC', 'I-FAC', 'I-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'U-GPE', 'O', 'B-ORG', 'L-ORG', 'U-CARDINAL', 'O', 'U-CARDINAL', 'O', 'B-CARDINAL', 'I-CARDINAL', 'L-CARDINAL', 'O', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'U-DATE', 'U-CARDINAL', 'O', 'O', 'B-TIME', 'L-TIME', 'U-ORG', 'O', 'U-ORG', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'U-TIME', 'U-ORG', 'O', 'U-ORG', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'U-TIME', 'U-ORG', 'O', 'U-FAC', 'O', 'O', 'O', 'O', 'U-TIME', 'U-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'U-FAC', 'O', 'O', 'U-TIME', 'U-ORG', 'O', 'U-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'U-GPE" S_OFF="1', 'O', 'O', 'O', 'O', 'O', 'B-TIME', 'I-TIME', 'I-TIME', 'L-TIME', 'B-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'B-FAC', 'L-FAC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL', 'O', 'U-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'B-DATE', 'L-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'U-DATE']
YanLiang1102 commented 6 years ago

99 0.000 3.572 0.000 57.933 49.355 53.302 91.894 85.899 1438.3 0.0 iteration result using eval_data 100 times ,the previosu result is iteration 30 times, so more iteration not helping

YanLiang1102 commented 6 years ago

need to gather more arabic ner training data: https://github.com/explosion/spaCy/issues/1966 https://lingpipe-blog.com/2009/07/28/arabic-named-entity-recognition-with-the-aner-corpus/ http://learningbeats.blogspot.com/2017/07/arabic-named-entity-recognition.html

YanLiang1102 commented 6 years ago

After add in the ANERCorp here is the accuracy: 29 0.000 11.407 0.000 58.158 50.018 53.782 91.894 85.899 6835.0 0.0 image

this is because out of 150k entity tokens, 88% of them are useless 'O' object so our performance not get enhanced @ahalterman @cegme

tags accuracy goes down a little bit.

so LDC+ANERCORP with no merged class, but fasttext pretrained embedding 29 0.000 10.027 0.000 56.598 51.344 53.843 91.894 85.899 13987.00.0

when trained by 10 times got a better performance: 9 0.000 270.721 0.000 56.944 52.855 54.823 91.894 85.899 13959.30.0

ahalterman commented 6 years ago

Well, not exactly "useless"..., since we need to be able to distinguish between entities and non-entities.

What are the different numbers in accuracy? Does each one represent a tag type? We need to figure out how to handle ANER not having a full range of labels that OntoNotes has. One way would be to go from spaCy format to Prodigy format, where each task is one single entity label, rather than all highlighted entities. Then when we use the more limited ANER data, we're not incorrectly telling it there's no entity there when there actually is.

It would also be a cool experiment to know whether "Prodigy-style" training underperforms spaCy training (and by how much).

YanLiang1102 commented 6 years ago

@ahalterman @cegme so the token accuracy from 58.589 to 58.158 but token accuracy got enhanced from 49.871 to 50.018. for the difference of these two take a look at here. http://web.stanford.edu/class/cs224n/assignment3/assignment3.pdf

YanLiang1102 commented 6 years ago

http://users.dsic.upv.es/~ybenajiba/downloads.html I am thinking using the ANERGazet using thsi to filter out the entity in some arabic docs we have and train the model based on that. so the ANER data is transferred from ANER format to BILOU format , and yes each word has a tag on it, what does the prodigy one looks like? and all the labeled tags has been transfered into one large documents, here is what the data looks like. https://raw.githubusercontent.com/oudalab/Arabic-NER/master/data/ANERCorp_conll_new.ner.json

some related stuff I found similar to what you are talking about @ahalterman https://support.prodi.gy/t/remarkable-difference-between-prodigy-and-custom-training-times/467/3

YanLiang1102 commented 6 years ago

pretrained embedding stuff? @ahalterman wonder if this is what u are talking about. or u have better examples? https://github.com/explosion/spaCy/issues/2084

https://spacy.io/usage/vectors-similarity#custom

YanLiang1102 commented 6 years ago
  1. add embeddings
  2. make baseline (spaCy style on Ontonotes and ANER).
    • (Combine GPE and LOC)
    • Remove the labels that aren't in ANER...
  3. Prodigy style with Ontonotes, ANER, Prodigy data
  4. Make "distantly supervised" data from wiki/Gigaword, train with Prodigy (edited),using the ANERGazet provided here:http://users.dsic.upv.es/~ybenajiba/downloads.html
YanLiang1102 commented 6 years ago

to avoid the catastrophic problem, our plan is to using spaCy and ldc+anercorp and we get the model, then we change the ldc+aner data into prodigy style and we update the model using the (ldc+anercorp+prodigy user labeled data) need to find the model how to update the model (probably in prodigy) or some cli code to do that @ahalterman

YanLiang1102 commented 6 years ago

Performance with pretrained embedding and merged tagged class is this: as the best:

'dep_loss'  'tag_loss'  'uas'  'tags_acc'  'token_acc'  'ents_p'  'ents_r'  'ents_f'  'cpu_wps'  'gpu_wps'
0.000       275.748      0.000  58.406      54.254       56.254    91.894    85.899    15447.2   0.0

token accuracy is 58.406 and entity accuracy is 54.254 and our base line is : 58.158 50.018 (with anercorp+ldc, no pretrained embedding, no merge ner tags) with anercorp+ldc, with merge tage with pretrained embedding token acurracy is similar from 58.406 drop to 58.158 and the entity accuracy enhanced a lot from 50.018 to 54.254. @ahalterman

ahalterman commented 6 years ago

Can you add the header to indicate what the 11 numbers mean?

YanLiang1102 commented 6 years ago

yeah it is at the top of this issue: and also here 'itr','dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc', 'ents_p', 'ents_r', 'ents_f', 'cpu_wps', 'gpu_wps' eval_data=eval_data

YanLiang1102 commented 6 years ago

training data tag distribution: {'B-LOC': 1889, 'B-MISC': 3521, 'B-ORG': 3912, 'B-PERSON': 6014, 'I-LOC': 744, 'I-MISC': 3812, 'I-ORG': 3630, 'I-PERSON': 1665, 'L-LOC': 1875, 'L-MISC': 3518, 'L-ORG': 3891, 'L-PERSON': 5982, 'O': 358866, 'U-LOC': 7323, 'U-MISC': 7264, 'U-ORG': 2576, 'U-PERSON': 3149} and test data tag distribution: {'B-LOC': 168, 'B-MISC': 307, 'B-ORG': 341, 'B-PERSON': 383, 'I-LOC': 46, 'I-MISC': 327, 'I-ORG': 465, 'I-PERSON': 93, 'L-LOC': 169, 'L-MISC': 307, 'L-ORG': 337, 'L-PERSON': 376, 'O': 27441, 'U-LOC': 384, 'U-MISC': 765, 'U-ORG': 173, 'U-PERSON': 204}

https://github.com/izarov/cs224n/blob/master/assignment3/handouts/assignment3-soln.pdf the confusion matrix for NER analysis is a good way to go to check which tags are really get misunderstood!