glample / tagger

Named Entity Recognition Tool
Apache License 2.0
1.16k stars 426 forks source link

File Format for Loading pretrained embeddings #13

Closed raghavchalapathy closed 8 years ago

raghavchalapathy commented 8 years ago

Hi

Could you please help clarify my doubt

I understand that the function below loads the pre_trained embeddings The comment says augment the dictionary with words that have pretrained embedding

def augment_with_pretrained(dictionary, ext_emb_path, words):
    """
    Augment the dictionary with words that have a pretrained embedding.
    If `words` is None, we add every word that has a pretrained embedding
    to the dictionary, otherwise, we only add the words that are given by
    `words` (typically the words in the development and test sets.)
    """
    print 'Loading pretrained embeddings from %s...' % ext_emb_path
    assert os.path.isfile(ext_emb_path)

My doubt is , I have train, dev, test in conll 2003 format, its very clear, How should the pretrained embedding file be saved?

I am planning to use word2vec , glove models which take each word in sentence as input and give an vector representation of the each of the word in sentences.

How am I suppose to input these vectors to models ? Could you please direct me to the code section which reads this vector representation?

What should be the file format of pretrained embedding file?

How will the word_id pick the vector representation while training which part of the code will handle this ?

Should the pretrained embedding file be like word_id Vector representation of word ?

Many thanks for clarifying the doubt in advance

with regards Raghav

jeradf commented 8 years ago

The pretrained word vectors are read in here: https://github.com/glample/tagger/blob/master/model.py#L169

It's expecting each line to have a word and a vector separated by a space. e.g. pizza -0.111804 0.056961 0.260559 -0.202473 -0.059456\n

raghavchalapathy commented 8 years ago

Thanks for the inputs , I have saved the file as shown image below using Google News Model image

But once the Pre_embeddings are loaded from file I get the warning as below

Loading pretrained embeddings from .... WARNING as : 28394 invalid lines

Looks like the if condition is not satisfied, May I know why are we maintaining this logic of word_dim+1 also , Is the format of pre_embeddings used in the above image correct ?

for i, line in enumerate(codecs.open(pre_emb, 'r', 'utf-8')):
                    line = line.rstrip().split()
                    if len(line) == word_dim + 1:
                        pretrained[line[0]] = np.array(
                            [float(x) for x in line[1:]]
                        ).astype(np.float32)
                    else:
                        emb_invalid += 1
                if emb_invalid > 0:
                    print 'WARNING: %i invalid lines' % emb_invalid
glample commented 8 years ago

Hi,

As jeradf pointed out, the word vectors file has to contain a word by line, and each line must contain the word, followed by the values of its associated embedding.

pizza -0.111804 0.056961 0.260559 -0.202473 -0.059456\n

In this example you have that the word embedding dimension for the word pizza is 5, and you can read the embeddings just by looking at the file content. The word_dim + 1 means that on each line, if you split the line by spaces, you are supposed to find word_dim + 1 values: 1 for the word itself, and word_dim for the values of the vector.

In your example I don't know what is the used representation, but it's clearly not the one taken by the tagger as input. It looks like a compressed version of the embeddings or something. Try to decompress it, or to find a version that verifies the tagger format (which is the most common one).

raghavchalapathy commented 8 years ago

Thanks for your eloborate comments, The representation was output by this web API service instead, I have loaded pretrained models of both word2vec as binary model and read glove.txt into a dict. Many thanks for the response again

Rabia-Noureen commented 7 years ago

Hi @raghavchalapathy i want to use Publicly available word vectors trained from Google News as pre-trained word embeddings available at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

Its a .gz file, but i dont have any idea how to use those word embeddings in my script. python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method=adam --tag_scheme=iob

Can you please guide me?I am stuck...

1049451037 commented 5 years ago

@raghavchalapathy Hi, I also met the problem. Could you provide a more detailed solution?