ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
659 stars 195 forks source link

Changin the language #6

Open flovera1 opened 7 years ago

flovera1 commented 7 years ago

Hi, I wanted to know if another language have the same pattern?. Here you treated the English file as GloVe, but I'm trying to do something similar to this project using Dutch as source language. How can I have the GloVe version of Dutch ?. Thank you.

ottokart commented 7 years ago

Hi, the pre-trained word embeddings don't have to necessarily be GloVe vectors, but can also be word2vec vectors. I found some pre-trained Dutch vectors from here: https://github.com/clips/dutchembeddings For these to work with punctuator2, you need to remove the header (first line) from the txt files. Best, Ottokar

ruohoruotsi commented 6 years ago

Hi, Just to make sure that I'm following the comments:

  1. IF you wanted to use pre-trained word embeddings for your language (say French or Spanish) because you had a small training data set, we can follow the advice above regarding GloVe or word2vec and maybe even use https://gist.github.com/ottokart/673d82402ad44e69df85 to make a We.pcl file. Is that correct?

  2. However, based on code I'm reading here: https://github.com/ottokart/punctuator2/blob/master/models.py#L128 it is not strictly required to use a pre-trained word-embedding IF you have lots of training data for your language. Is that also correct?

Thanks you very much and great work!

ottokart commented 6 years ago
  1. To use pre-trained embeddings you should point PRETRAINED_EMBEDDINGS_PATH to the embeddings file in text format in data.py (data.py will build the required We.pcl file from the text file). This text file should be limited to some reasonable amount of top words (e.g. 100 000). I created a new gist for creating this text file from a binary gzipped word2vec file https://gist.github.com/ottokart/4031dfb471ad5c11d97ad72cbc01b934
  2. That's correct - using pre-trained word embeddings is completely optional (they helped me on TED Talks dataset, but on larger datasets I don't generally use them).
MeteorBurn commented 6 years ago

Hi, I found some word2vec files of my language and I also find your two scripts, that convert .bin(.vec) to embeddings.txt https://gist.github.com/ottokart/4031dfb471ad5c11d97ad72cbc01b934 and convert .bin(.vec) to .pcl https://gist.github.com/ottokart/673d82402ad44e69df85 I tried to use the second script and in a result I got "myEmbedings.pcl" but it didn't work with punctuator.

How I can adopt my word2vec file to punctuator and what's difference between two scripts?

Thank you in advance