oudalab / Arabic-NER

32 stars 11 forks source link

prodigy training can not handle big pre-trained vector, so need to prune that vector #18

Closed YanLiang1102 closed 6 years ago

YanLiang1102 commented 6 years ago

two things to try: 1.prune the vector before use spacy to train and put the output vectors (it output a language model from the pruning, copy and paste the vectors to the model) get error when using this model to train the mixed in data. --failed

  1. prune the vector , then use spacy to train with this vector and get the model and use this model to train the mixedin data again. ---let see.
YanLiang1102 commented 6 years ago
ine 278, in Tok2Vec
    glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
  File "/home/yan/spacyOU/spacy-vir/lib/python3.5/site-packages/thinc/neural/_classes/static_vectors.py", line 41, in __init__
    vectors = self.get_vectors()
  File "/home/yan/spacyOU/spacy-vir/lib/python3.5/site-packages/thinc/neural/_classes/static_vectors.py", line 52, in get_vectors
    return get_vectors(self.ops, self.lang)
  File "/home/yan/spacyOU/spacy-vir/lib/python3.5/site-packages/thinc/extra/load_nlp.py", line 19, in get_vectors
    nlp = get_spacy(lang)
  File "/home/yan/spacyOU/spacy-vir/lib/python3.5/site-packages/thinc/extra/load_nlp.py", line 11, in get_spacy
    SPACY_MODELS[lang] = spacy.load(lang, **kwargs)
  File "/home/yan/spacyOU/spacy-vir/lib/python3.5/site-packages/spacy/__init__.py", line 15, in load
    return util.load_model(name, **overrides)
  File "/home/yan/spacyOU/spacy-vir/lib/python3.5/site-packages/spacy/util.py", line 119, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'ar_model.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
YanLiang1102 commented 6 years ago

Just to remind myself all of these experiments are done on Manchester.

make the pruned vectors working and make prodigy to update the NER model, this is what we get:

image

directly using the pruned vectors language model without a pre-trained NER model, actually the performance is better :( so our pre-trained ner model seems not help at all!

image

YanLiang1102 commented 6 years ago

@ahalterman it looks like the pretrained ner model does not help at all, see the comments above, using a blank language model with no pretrained ner model performs better than the model has an ner model in it. And also I updated the steps in the read me file. The interesting thing is the model with Ner at first works much better than the empty language model, later the empty language model catched up....

YanLiang1102 commented 6 years ago
python3 -m prodigy ner.batch-train arabicner /home/yan/arabicner/Arabic-NER/xx_raw_fasttext_model_1 --eval-split 0.2

image so used the pruned trained model from LDC data and then directly trained on prodigy labelled data without the rehearsal step. (I wonder anything wrong with the rehearsal code maybe?) why it gets higher @ahalterman

YanLiang1102 commented 6 years ago
python3 -m prodigy ner.batch-train arabicner /home/yan/arabicner/Arabic-NER/xx_raw_fasttext_model_1 --eval-split 0.2

trained with an empty nermodel with pretrainde vector and with only prodigy labelled data image

YanLiang1102 commented 6 years ago
python3 -m prodigy ner.batch-train augmented_for_training_2 /home/yan/arabicner/Arabic-NER/xx_raw_fasttext_model --eval-split 0.2

empty model trained with only LDC data image

YanLiang1102 commented 6 years ago

trained result on all ontoNotes tags like 40k tokens with prodigy, previously we only use 4000 to reheasal, our accuracy get t 72.1%!!! image

YanLiang1102 commented 6 years ago

the screenshot uploaded today is in Chinese training data....