Open mmcenta opened 4 years ago
Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file.
For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size
, otherwise just abort the program.
Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file.
So I have been using the version on this branch for my project in link prediction and I've already run the program in cases 1 and 3. As soon as we figure out how to deal with invalid pre-trained files I can run all the tests.
For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as
--representation_size
, otherwise just abort the program.
I was thinking about disabling the --representation-size
flag if we are using the --pretrained
one (i.e. setting them to be mutually exclusive). That's another way to solve the problem and both are pretty easy to implement, I imagine. What do you think?
Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?
I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens.
Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon.
For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as
--representation_size
, otherwise just abort the program.I was thinking about disabling the
--representation-size
flag if we are using the--pretrained
one (i.e. setting them to be mutually exclusive). That's another way to solve the problem and both are pretty easy to implement, I imagine. What do you think?
Good idea on making them mutually exclusive!
Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?
I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens.
Or maybe you could be more aggressive here: assert that the vocab in the pre-trained embeddings is the same as graph nodes, otherwise abort the program
Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon.
No rush, good luck with your finals :-)
I'm back from finals and vacations!
I just implemented two of the improvements we talked about, and I wanted your opinion on this next one: the way the code is implemented right now, the vocabulary is a union of the pre-trained one and the one built from the walks. We can either 1) only intersect the pre-trained embeddings with the vocabulary from the walks or 2) assert that they are equal and terminate otherwise. From what I learned, 1) is really easy and quick to implement and 2) takes linear time on the size of the vocabularies. What do you think is the best option?
As per #107, this PR implements an extra parameter,
--pretrained
, that can be used to specify pre-trained embeddings that will be used as the initializers for the node embeddings. The specified file must be in the traditional C node2vec format (same as the program outputs).Help Wanted: how to check if the dimensions match? Either we assert that the loaded vectors have the same dimension as the value passed into the
--representation_size
parameter or we infer the representation size from the pre-trained embeddings (and make the parameters mutually exclusive).Also, I read through the README to check if I needed to update something, but I decided that this new parameter didn't fit anywhere, so I should either not alter it or add a new subsection (up to you).