Closed WaveLi123 closed 7 years ago
Hi,
Sorry about this, I probably forgot to reply.. Here are the embeddings: https://drive.google.com/open?id=0B23ji47zTOQNUXYzbUVhdDN2ZmM
Best, Guillaume
Thank you! Is this just English?
Yes. Here are the others:
Dutch: https://drive.google.com/open?id=0B23ji47zTOQNckpFdDVTX1JRYzQ German: https://drive.google.com/open?id=0B23ji47zTOQNdGdqTkk5QWRTZkU Spanish: https://drive.google.com/open?id=0B23ji47zTOQNNzd1SDJibm1BWk0
German and Spanish embeddings are pretty good if I remember correctly, but the Dutch ones are bad, I would not use them. I think the Dutch model could easily be 5 F1 better if the embeddings were trained on a bigger corpus (I forgot which corpus we used, but it was really small).
@glample you have deleted my comment should i create a new issue?I need your help regarding the error.
@glample i have created a new issue please have a look https://github.com/glample/tagger/issues/62.
Thank you!@glample
Hi @cosmozhang i wanted to confirm that in order to train the model using these word embeddings the format to run the script is: python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method adam --tag_scheme iob --pre_emb Skip100 --all_emb 100
@Rabia-Noureen I think so.
@cosmozhang Thanks for the response! One more thing if you are using Windows to run the code please suggest any solution for the issue https://github.com/glample/tagger/issues/62.
I am on Linux and Mac OS. I am not using windows for research. :) @Rabia-Noureen
Oh okay thanks any ways. :)
@Rabia-Noureen By having a quick look, it is not a problem related with OS. I might also encouter it later. I did not tried to use the embedings yet.
@cosmozhang I even tried to train the model without word embeddings but i am still facing that issue. Please let me know if you also face it later on, I am new to python so i dont have any idea how to resolve it. I am using the dataset provided with the code https://github.com/glample/tagger/tree/master/dataset.
@Rabia-Noureen Yes, I also encountered it when I am using the embeddings. I am planning to have a look later. Because I am using pytorch now, so I just want to reproduce the results in pytorch.
Pytorch is much more freindly than theano, though I was heavily on Theano before as well. @Rabia-Noureen
Is this code also working well on pytorch without modification? I guess pytorch is not available for Windows OS.
@cosmozhang Please let me know when ever you are able to resolve the error, i am stuck for the past 2 months. I will wait for your response Thanks for the help...
@Rabia-Noureen yes I deleted your previous post. Sorry, but it was not related to the topic "Where is pretrained word embeddings?". I would appreciate if you don't post your issues at the end of issues created by other users on a different topic. I saw your problem, and I really don't know... This is weird, the loss is barely decreasing. I'll be very busy until Friday, then I will have a look at your problem and see if I can help you debug it.
@glample I am sorry for posting it here but i have been trying to get help by creating issues for the past 2 months so i decided to contact you here when i saw your comment yesterday. No problem i will wait for your response regarding debugging the problem. I shall be thankful if you are able to help. Thanks
@glample i am waiting for your response on my issue please help me debug the problem i am stuck.... Thanks
Can you send me an email with your exact settings, problem, and what you have tried to fix it so far?
@Rabia-Noureen Just do this:
chmod +x evaluation/conlleval
@glample sure i will send you the email please tell me your email address...
@cosmozhang chmod is not working on windows i have tried to find an alternate i found Attrib so i am going to try Attrib +x evaluation/conlleval. Has it solved the issue for you?
firstname.lastname@gmail.com
@Rabia-Noureen Yes! Why not try to use Linux? It is super convenient to do so.
@glample is this your exact email firstname.lastname@gmail.com?I have tried to send the email but it failed to deliver. Please check it again....
@cosmozhang yes i was thing about it. Some one suggested me to use docker for windows. Do you have any idea can i install the linux environment on docker and use my gpu settings from windows? Or i will have to go through all the installations and python and gpu configurations again on linux? As i dont want to go through all the installations all over again, its very time consuming i have to report my progress to the supervisor next week....
@Rabia-Noureen replace "firstname" and "lastname" by my real first name and last name..
@glample thanks i have just sent the email.
@cosmozhang I have installed ubuntu 16.04 on virtualbox. Now the tagger is working fine but i want to use GoogleNews-vectors-negative300.bin.gz and glove.840B.300d.zip word embeddings in order to train my model. I am unable to load and use them in Python by extracting with normal extracting software. I get this error with GoogleNews-vectors-negative300.bin.
Skip100 is working fine because its not in compressed form. Can you please help me how can i extract and use the embeddings? Links are down below https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit https://nlp.stanford.edu/projects/glove/ I also tried to use them without extracting but failed.
(my_env) acer@acer:~/tagger$ python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method adam --tag_scheme iob --pre_emb GoogleNews-vectors-negative300.bin --all_emb 300
Model location: ./models/tag_scheme=iob,lower=False,zeros=False,char_dim=25,char_lstm_dim=25,char_bidirect=True,word_dim=100,word_lstm_dim=100,word_bidirect=True,pre_emb=GoogleNews-vectors-negative300.bin,all_emb=False,cap_dim=0,crf=True,dropout=0.5,lr_method=adam
Found 23624 unique words (203621 in total)
Loading pretrained embeddings from GoogleNews-vectors-negative300.bin...
Traceback (most recent call last):
File "train.py", line 162, in <module>
) if not parameters['all_emb'] else None
File "/home/acer/tagger/loader.py", line 169, in augment_with_pretrained
for line in codecs.open(ext_emb_path, 'r', 'utf-8')
File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 699, in next
return self.reader.next()
File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 630, in next
line = self.readline()
File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 545, in readline
data = self.read(readsize, firstline=True)
File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 492, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte
@glample any suggestions please?
What is the content of the GoogleNews-vectors-negative300.bin file? Can you copy the first lines of this file here? If this is a binary file then you can't load it with the tagger. The tagger will only load a text file where you have one word embedding per line.
Its a .iso file not a text file so i cant open it. Then how can i load that file? It has been used in a research study with tagger but i dont know how....
Why not using the embeddings we used in the paper instead? https://drive.google.com/open?id=0B23ji47zTOQNUXYzbUVhdDN2ZmM
Actually i have read in a paper that glove.840B.300d gives the best results for NLP using tagger so i wanted to use that in order to improve the accuracy. I have extracted this file into a text file but it also has some issue. Please have a look at it. Otherwise if it cant be solved i will have to use Skip100 as you have used.
Is it working with the Skip100 embeddings? Pretty sure the Glove ones won't be better. What paper are you referring to?
@glample Yes i have tried Skip100, it is working fine. I am referring to the paper Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks(page 13), It has used Glove word embeddings And Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN- This paper has used Google news word embeddings for tagger.
This paper does not compare the Glove embeddings with Skip100, and I doubt it will work better. Anyway. Can you copy-paste here the few first lines of the Glove embeddings text file?
@glample All the pretrained vectors (English, Dutch, German, Spanish) are skip-n-gram or some are skip-gram and some are skip-n-gram?
Everything is skip-n-gram.
Can you give me the link of the list of corpa for English, Dutch, German, Spanish that you used to train skip-n-gram pre-trained vector?
Why did you use different size of the pretrained vector for the different languages? I have seen here that you used 100 for english and 64 for the rest of the language.
The embeddings were trained by someone in my lab while I was at CMU (no idea why the dimension is not the same), and I don't have access to the corpora anymore. The corpora are listed in the paper, but I don't know if there is a link to it.
@glample Thank you for your reply
@glample Hi author, I've requested for a skip 100 file through your link above, but I don't have permission, I need you to approve it.
I am also curious about it. I sent the author an email (He is in Facebook now) but no response.