oudalab / Arabic-NER

32 stars 11 forks source link

Issue for using customized pretained vector. #8

Closed YanLiang1102 closed 6 years ago

YanLiang1102 commented 6 years ago

yan@hanover:~/ou-spacy/spaCy$ python3 -m spacy init-model ar /tmp/ar_vectors_wiki_lg --vectors-loc ../cc.ar.300.bin.gz Reading vectors from ../cc.ar.300.bin.gz Open loc Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/yan/ou-spacy/spaCy/spacy/main.py", line 31, in plac.call(commands[command], sys.argv[1:]) File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 328, in call cmd, result = parser.consume(arglist) File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 207, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs) File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 49, in init_model vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None) File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 111, in read_vectors shape = tuple(int(size) for size in next(f).split()) File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 64, in return (line.decode('utf8') for line in gzip.open(str(loc), 'r')) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

ahalterman commented 6 years ago

Ahh, I know the problem. You need to use the text vectors, not the bin vectors. Here's the link to the right one: https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ar.300.vec.gz

YanLiang1102 commented 6 years ago

@ahalterman yes!

since it is big and store in the /tmp dir if the /tmp directory build model is gone, we need to run this script again

python3 -m spacy init-model ar /tmp/ar_vectors_wiki_lg --vectors-loc ../cc.ar.300.vec.gz
YanLiang1102 commented 6 years ago

when training from cli command line needs to pass in the pretrained model in this way

python3 -m spacy train ar /home/yan/arabicNER/Arabic-NER/experiments/exp2/ar_output_all /home/yan/arabicNER/Arabic-NER/data/combined.json /home/yan/arabicNER/Arabic-NER/data/ar_eval_all.json --no-tagger --no-parser --vectors "/tmp/ar_vectors_wiki_lg"
YanLiang1102 commented 6 years ago

python3 -m spacy init-model ar /tmp/ar_vectors_wiki_lg --vectors-loc ../cc.ar.300.vec.gz should use the vec model instead of the bin model