Where is pretrained word embeddings? - Githubissues

glample / tagger

Named Entity Recognition Tool

Apache License 2.0

1.16k stars 426 forks source link

Where is pretrained word embeddings? #44

Closed WaveLi123 closed 7 years ago

cosmozhang commented 7 years ago

I am also curious about it. I sent the author an email (He is in Facebook now) but no response.

glample commented 7 years ago

Hi,

Sorry about this, I probably forgot to reply.. Here are the embeddings: https://drive.google.com/open?id=0B23ji47zTOQNUXYzbUVhdDN2ZmM

Best, Guillaume

cosmozhang commented 7 years ago

Thank you! Is this just English?

glample commented 7 years ago

Yes. Here are the others:

Dutch: https://drive.google.com/open?id=0B23ji47zTOQNckpFdDVTX1JRYzQ German: https://drive.google.com/open?id=0B23ji47zTOQNdGdqTkk5QWRTZkU Spanish: https://drive.google.com/open?id=0B23ji47zTOQNNzd1SDJibm1BWk0

German and Spanish embeddings are pretty good if I remember correctly, but the Dutch ones are bad, I would not use them. I think the Dutch model could easily be 5 F1 better if the embeddings were trained on a bigger corpus (I forgot which corpus we used, but it was really small).

Rabia-Noureen commented 7 years ago

@glample you have deleted my comment should i create a new issue?I need your help regarding the error.

Rabia-Noureen commented 7 years ago

@glample i have created a new issue please have a look https://github.com/glample/tagger/issues/62.

cosmozhang commented 7 years ago

Thank you！@glample

Rabia-Noureen commented 7 years ago

Hi @cosmozhang i wanted to confirm that in order to train the model using these word embeddings the format to run the script is: python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method adam --tag_scheme iob --pre_emb Skip100 --all_emb 100

cosmozhang commented 7 years ago

@Rabia-Noureen I think so.

Rabia-Noureen commented 7 years ago

@cosmozhang Thanks for the response! One more thing if you are using Windows to run the code please suggest any solution for the issue https://github.com/glample/tagger/issues/62.

cosmozhang commented 7 years ago

I am on Linux and Mac OS. I am not using windows for research. :) @Rabia-Noureen

Rabia-Noureen commented 7 years ago

Oh okay thanks any ways. :)

cosmozhang commented 7 years ago

@Rabia-Noureen By having a quick look, it is not a problem related with OS. I might also encouter it later. I did not tried to use the embedings yet.

Rabia-Noureen commented 7 years ago

@cosmozhang I even tried to train the model without word embeddings but i am still facing that issue. Please let me know if you also face it later on, I am new to python so i dont have any idea how to resolve it. I am using the dataset provided with the code https://github.com/glample/tagger/tree/master/dataset.

cosmozhang commented 7 years ago

@Rabia-Noureen Yes, I also encountered it when I am using the embeddings. I am planning to have a look later. Because I am using pytorch now, so I just want to reproduce the results in pytorch.

cosmozhang commented 7 years ago

Pytorch is much more freindly than theano, though I was heavily on Theano before as well. @Rabia-Noureen

Rabia-Noureen commented 7 years ago

Is this code also working well on pytorch without modification? I guess pytorch is not available for Windows OS.

Rabia-Noureen commented 7 years ago

@cosmozhang Please let me know when ever you are able to resolve the error, i am stuck for the past 2 months. I will wait for your response Thanks for the help...

glample commented 7 years ago

@Rabia-Noureen yes I deleted your previous post. Sorry, but it was not related to the topic "Where is pretrained word embeddings?". I would appreciate if you don't post your issues at the end of issues created by other users on a different topic. I saw your problem, and I really don't know... This is weird, the loss is barely decreasing. I'll be very busy until Friday, then I will have a look at your problem and see if I can help you debug it.

Rabia-Noureen commented 7 years ago

@glample I am sorry for posting it here but i have been trying to get help by creating issues for the past 2 months so i decided to contact you here when i saw your comment yesterday. No problem i will wait for your response regarding debugging the problem. I shall be thankful if you are able to help. Thanks

Rabia-Noureen commented 7 years ago

@glample i am waiting for your response on my issue please help me debug the problem i am stuck.... Thanks

glample commented 7 years ago

Can you send me an email with your exact settings, problem, and what you have tried to fix it so far?

cosmozhang commented 7 years ago

@Rabia-Noureen Just do this: chmod +x evaluation/conlleval

Rabia-Noureen commented 7 years ago

@glample sure i will send you the email please tell me your email address...

Rabia-Noureen commented 7 years ago

@cosmozhang chmod is not working on windows i have tried to find an alternate i found Attrib so i am going to try Attrib +x evaluation/conlleval. Has it solved the issue for you?

glample commented 7 years ago

firstname.lastname@gmail.com

cosmozhang commented 7 years ago

@Rabia-Noureen Yes! Why not try to use Linux? It is super convenient to do so.

Rabia-Noureen commented 7 years ago

@glample is this your exact email firstname.lastname@gmail.com?I have tried to send the email but it failed to deliver. Please check it again....

Rabia-Noureen commented 7 years ago

@cosmozhang yes i was thing about it. Some one suggested me to use docker for windows. Do you have any idea can i install the linux environment on docker and use my gpu settings from windows? Or i will have to go through all the installations and python and gpu configurations again on linux? As i dont want to go through all the installations all over again, its very time consuming i have to report my progress to the supervisor next week....

glample commented 7 years ago

@Rabia-Noureen replace "firstname" and "lastname" by my real first name and last name..

Rabia-Noureen commented 7 years ago

@glample thanks i have just sent the email.

Rabia-Noureen commented 6 years ago

@cosmozhang I have installed ubuntu 16.04 on virtualbox. Now the tagger is working fine but i want to use GoogleNews-vectors-negative300.bin.gz and glove.840B.300d.zip word embeddings in order to train my model. I am unable to load and use them in Python by extracting with normal extracting software. I get this error with GoogleNews-vectors-negative300.bin.

Skip100 is working fine because its not in compressed form. Can you please help me how can i extract and use the embeddings? Links are down below https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit https://nlp.stanford.edu/projects/glove/ I also tried to use them without extracting but failed.

(my_env) acer@acer:~/tagger$ python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method adam --tag_scheme iob --pre_emb GoogleNews-vectors-negative300.bin --all_emb 300
Model location: ./models/tag_scheme=iob,lower=False,zeros=False,char_dim=25,char_lstm_dim=25,char_bidirect=True,word_dim=100,word_lstm_dim=100,word_bidirect=True,pre_emb=GoogleNews-vectors-negative300.bin,all_emb=False,cap_dim=0,crf=True,dropout=0.5,lr_method=adam
Found 23624 unique words (203621 in total)
Loading pretrained embeddings from GoogleNews-vectors-negative300.bin...
Traceback (most recent call last):
  File "train.py", line 162, in <module>
    ) if not parameters['all_emb'] else None
  File "/home/acer/tagger/loader.py", line 169, in augment_with_pretrained
    for line in codecs.open(ext_emb_path, 'r', 'utf-8')
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 699, in next
    return self.reader.next()
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 630, in next
    line = self.readline()
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 545, in readline
    data = self.read(readsize, firstline=True)
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 492, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte

Rabia-Noureen commented 6 years ago

@glample any suggestions please?

glample commented 6 years ago

What is the content of the GoogleNews-vectors-negative300.bin file? Can you copy the first lines of this file here? If this is a binary file then you can't load it with the tagger. The tagger will only load a text file where you have one word embedding per line.

Rabia-Noureen commented 6 years ago

Its a .iso file not a text file so i cant open it. Then how can i load that file? It has been used in a research study with tagger but i dont know how....

glample commented 6 years ago

Why not using the embeddings we used in the paper instead? https://drive.google.com/open?id=0B23ji47zTOQNUXYzbUVhdDN2ZmM

Rabia-Noureen commented 6 years ago

Actually i have read in a paper that glove.840B.300d gives the best results for NLP using tagger so i wanted to use that in order to improve the accuracy. I have extracted this file into a text file but it also has some issue. Please have a look at it. Otherwise if it cant be solved i will have to use Skip100 as you have used.

glample commented 6 years ago

Is it working with the Skip100 embeddings? Pretty sure the Glove ones won't be better. What paper are you referring to?

Rabia-Noureen commented 6 years ago

@glample Yes i have tried Skip100, it is working fine. I am referring to the paper Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks(page 13), It has used Glove word embeddings And Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN- This paper has used Google news word embeddings for tagger.

glample commented 6 years ago

This paper does not compare the Glove embeddings with Skip100, and I doubt it will work better. Anyway. Can you copy-paste here the few first lines of the Glove embeddings text file?

sbmaruf commented 6 years ago

@glample All the pretrained vectors (English, Dutch, German, Spanish) are skip-n-gram or some are skip-gram and some are skip-n-gram?

glample commented 6 years ago

Everything is skip-n-gram.

sbmaruf commented 6 years ago

Can you give me the link of the list of corpa for English, Dutch, German, Spanish that you used to train skip-n-gram pre-trained vector?
Why did you use different size of the pretrained vector for the different languages? I have seen here that you used 100 for english and 64 for the rest of the language.

glample commented 6 years ago

The embeddings were trained by someone in my lab while I was at CMU (no idea why the dimension is not the same), and I don't have access to the corpora anymore. The corpora are listed in the paper, but I don't know if there is a link to it.

sbmaruf commented 6 years ago

@glample Thank you for your reply

wcw15 commented 2 years ago

@glample Hi author, I've requested for a skip 100 file through your link above, but I don't have permission, I need you to approve it.