Add option for pretrained embeddings

mmcenta commented 4 years ago

As per #107, this PR implements an extra parameter, --pretrained, that can be used to specify pre-trained embeddings that will be used as the initializers for the node embeddings. The specified file must be in the traditional C node2vec format (same as the program outputs).

Help Wanted: how to check if the dimensions match? Either we assert that the loaded vectors have the same dimension as the value passed into the --representation_size parameter or we infer the representation size from the pre-trained embeddings (and make the parameters mutually exclusive).

Also, I read through the README to check if I needed to update something, but I decided that this new parameter didn't fit anywhere, so I should either not alter it or add a new subsection (up to you).

GTmac commented 4 years ago

Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file.

GTmac commented 4 years ago

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

mmcenta commented 4 years ago

Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file.

So I have been using the version on this branch for my project in link prediction and I've already run the program in cases 1 and 3. As soon as we figure out how to deal with invalid pre-trained files I can run all the tests.

mmcenta commented 4 years ago

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

I was thinking about disabling the --representation-size flag if we are using the --pretrained one (i.e. setting them to be mutually exclusive). That's another way to solve the problem and both are pretty easy to implement, I imagine. What do you think?

mmcenta commented 4 years ago

Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?

I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens.

mmcenta commented 4 years ago

Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon.

GTmac commented 4 years ago

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

I was thinking about disabling the --representation-size flag if we are using the --pretrained one (i.e. setting them to be mutually exclusive). That's another way to solve the problem and both are pretty easy to implement, I imagine. What do you think?

Good idea on making them mutually exclusive!

GTmac commented 4 years ago

Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?

I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens.

Or maybe you could be more aggressive here: assert that the vocab in the pre-trained embeddings is the same as graph nodes, otherwise abort the program

GTmac commented 4 years ago

Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon.

No rush, good luck with your finals :-)

mmcenta commented 4 years ago

I'm back from finals and vacations!

I just implemented two of the improvements we talked about, and I wanted your opinion on this next one: the way the code is implemented right now, the vocabulary is a union of the pre-trained one and the one built from the walks. We can either 1) only intersect the pre-trained embeddings with the vocabulary from the walks or 2) assert that they are equal and terminate otherwise. From what I learned, 1) is really easy and quick to implement and 2) takes linear time on the size of the vocabularies. What do you think is the best option?

phanein / deepwalk

Add option for pretrained embeddings #108