Closed raghavchalapathy closed 8 years ago
The pretrained word vectors are read in here: https://github.com/glample/tagger/blob/master/model.py#L169
It's expecting each line to have a word and a vector separated by a space. e.g.
pizza -0.111804 0.056961 0.260559 -0.202473 -0.059456\n
Thanks for the inputs , I have saved the file as shown image below using Google News Model
But once the Pre_embeddings are loaded from file I get the warning as below
Loading pretrained embeddings from .... WARNING as : 28394 invalid lines
Looks like the if condition is not satisfied, May I know why are we maintaining this logic of word_dim+1 also , Is the format of pre_embeddings used in the above image correct ?
for i, line in enumerate(codecs.open(pre_emb, 'r', 'utf-8')):
line = line.rstrip().split()
if len(line) == word_dim + 1:
pretrained[line[0]] = np.array(
[float(x) for x in line[1:]]
).astype(np.float32)
else:
emb_invalid += 1
if emb_invalid > 0:
print 'WARNING: %i invalid lines' % emb_invalid
Hi,
As jeradf pointed out, the word vectors file has to contain a word by line, and each line must contain the word, followed by the values of its associated embedding.
pizza -0.111804 0.056961 0.260559 -0.202473 -0.059456\n
In this example you have that the word embedding dimension for the word pizza is 5, and you can read the embeddings just by looking at the file content. The word_dim + 1
means that on each line, if you split the line by spaces, you are supposed to find word_dim + 1
values: 1
for the word itself, and word_dim
for the values of the vector.
In your example I don't know what is the used representation, but it's clearly not the one taken by the tagger as input. It looks like a compressed version of the embeddings or something. Try to decompress it, or to find a version that verifies the tagger format (which is the most common one).
Thanks for your eloborate comments, The representation was output by this web API service instead, I have loaded pretrained models of both word2vec as binary model and read glove.txt into a dict. Many thanks for the response again
Hi @raghavchalapathy i want to use Publicly available word vectors trained from Google News as pre-trained word embeddings available at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
Its a .gz file, but i dont have any idea how to use those word embeddings in my script. python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method=adam --tag_scheme=iob
Can you please guide me?I am stuck...
@raghavchalapathy Hi, I also met the problem. Could you provide a more detailed solution?
Hi
Could you please help clarify my doubt
I understand that the function below loads the pre_trained embeddings The comment says augment the dictionary with words that have pretrained embedding
My doubt is , I have train, dev, test in conll 2003 format, its very clear, How should the pretrained embedding file be saved?
I am planning to use word2vec , glove models which take each word in sentence as input and give an vector representation of the each of the word in sentences.
How am I suppose to input these vectors to models ? Could you please direct me to the code section which reads this vector representation?
What should be the file format of pretrained embedding file?
How will the word_id pick the vector representation while training which part of the code will handle this ?
Should the pretrained embedding file be like word_id Vector representation of word ?
Many thanks for clarifying the doubt in advance
with regards Raghav