stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

How to fine tune the pre-trained GloVe vectors on a custom corpus #189

Open smousav9 opened 3 years ago

smousav9 commented 3 years ago

Hi,

Hope you are having a great time. I need to fine-tune the pre-trained GloVe vectors on a custom corpus and I was wondering how I can do it with the GloVe library. My understanding of fine-tuning is to initialize the value of word vectors (at the beginning of fine-tuning) to the values of the pre-trained word vectors. There is a parameter in the "glove.c" named "load_init_param". If the value of this parameter is set to "1", then the code will look for a "-init-param-file" file to read the parameters from an input file. I tried to understand what should the format of the initialization file look like and whether initial word vectors are part of this initialization parameter or not, since C is not my programing language, I did not successfully understand all the details of it. I appreciate it if someone can help me initialize the word vectors with pre-trained word vectors to fine-tune the GloVe on my corpus?

Thanks Maryam

AngledLuffa commented 3 years ago

There's an option to save the initial parameters, -save-init-param. The -load-init-param functionality is to read those parameters back in. You can also read in the parameters from an intermediate model. Look at the save_params function for the format.

smousav9 commented 3 years ago

Thank you for the help

smousav9 commented 3 years ago

@AngledLuffa Thank you for the previous comment and clarification, however, I am still struggling to convert the initial txt file to a bin file? Is there any written c or python script to help me? The glove.c receives the initial-param-files in a bin file format. I am creating my initialization file in txt using python, I need to convert it to a bin file so that the program can read it. The question is how?

Also, there is a shuffling step before training. How does that affect the initialization step?

Regards, Maryam

AngledLuffa commented 3 years ago

Sorry for the late reply. Basically, there's a specific format in glove.c where an array is written out as a sequence of bytes. You can look at load_init_params and save_params to see this format. In python, the equivalent command for writing an int as a sequence of bytes is int.to_bytes()