Open ghost opened 8 years ago
Note that in practice, train_cbow_pair()
is only the pure-Python implementation, which is rarely used because it can be 80x (or more) slower that the alternate optimized Cython path.
The 'input' word vectors are in syn0
. You could modify them there, if you want to apply your transformations to affect all subsequent training. You'd have to get to them before l1
is composed, if you want to transform them for each training-example before they're forward-propagated. (With more idea of what kind of transformation you'd like to apply, and why, it might be possible to make a better recommendation.)
Note that word.point
is actually a list of indexes into the hierarchical-softmax tree: all the output nodes that (together, if they each have the 0/1 activations in the corresponding codes
) represent the target word. (Personally I find the negative-sampling codepaths easier to understand, and many projects, especially with large vocabularies and large corpuses, seem to prefer negative-sampling training. In negative sampling, there's a single output node corresponding to a target word, and thus the weights into that are a sort of 'output vector' for that word, unlike in hierarchical-softmax where a word is implied by a pattern of codes
activations in the multiple 'points` output nodes.)
@gojomo I started to notice that the Python path is incredibly much slower than the Cython path. Is there a way for me to replace the compilation with my own?
The transformations I'd like to apply are actually simple scaling operations - I just want to apply weights to the input vectors dynamically.
The Cython code is not compiled Python - it's alternate code, in the Cython python-like language, written under more constraints that allow better performance (and greater multithreaded paralleism). All the Cython source is available for you to modify just like the Python code – see the files ending .pyx
or .pxd
. However, getting things right can be trickier – more like C-programming – and the .pyx
is compiled autogenerated .c
files (which should not be edited) to then become the shared-library usable from Python.
If the weights are the same for all examples, you could directly modify syn0
before training, or interspersed with training (in the pure-Python code). But to apply different scaling every text-example/training-example, and still run the optimized code, you'll likely need to modify the Cython code.
@gojomo Thank you for explaining that to me - I never had anything to do with Cython so far but it's good to know what I'm actually dealing with here :)
Unfortunately, since the weights are not the same for all words, I cannot apply the weight to syn0
. At the moment I put it like this
l1 = np.sum((word_vectors[word2_indexes].T * word_weights[idx]).T, axis=0) + np.sum(doctag_vectors[doctag_indexes], axis=0)
but as you already said this is very slow now.
The only way I can potentially see around this, without having to alter the Cython implementation, would be the input parameter word_vectors
of train_document_dm
.
def train_document_dm(model, doc_words, #...
word_vectors=None,
# ...
):
if word_vectors is None:
word_vectors = model.syn0
# ...
One could simply set word_vectors = model.syn0
beforehand but apply unique weights on certain vectors required in the current sentence.
The only thing I don't know here is where it's getting decided whether the Python or the Cython path is going to be used.
I'm thinking about something like this:
elif self.dm_weighted:
windices = [i for i in range(len(doc.words)) if doc.words[i] in self.vocab and
self.vocab[doc.words[i]].sample_int > self.random.rand() * 2 ** 32]
word_weights = np.asarray([doc.weights[i] for i in windices])
# Make copy of affected vectors for later restoration
word_vectors_copy = self.syn0[windices].copy()
# Apply weights
self.syn0 = (self.syn0[windices].T * word_weights).T
tally += train_document_dm(self, doc.words, doctag_indexes, alpha, work, neu1,
# Pass those word vectors directly
word_vectors=self.syn0,
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
# Restore unscaled vectors
self.syn0[windices] = word_vectors_copy
Unfortunately it does not run. It appears that changing self.syn0
causes a problem. The program does run but at some point it's simply crashing.
You could transform syn0
in place with different weights per word, but note the transformation will stick (not be specific to a single forward-propagation).
The word_vectors
parameter is already self.syn0
in all cases. You could replace it with something else – your transformed syn0
– but as syn0
is typically quite large, and train_document_dm()
is called for every text example, that'd be a costly operation.
When the code loads, either the Cython methods are imported, or the Python alternatives are defined. So it's not decided each time - only one or the other implementation of each method is present. See https://github.com/RaRe-Technologies/gensim/blob/a84f64e7b617a3983c2b332c8383e1a30b14db5d/gensim/models/doc2vec.py#L60
I'm not sure where your example snippet code might be intended to appear, but a crash likely indicates your new syn0
isn't the exact shape/native-type as the optimized code as the original/expected. You probably don't need to copy syn0
- just hold a reference to the original array. Your later assignment isn't going to re-write that object in place, but replace the reference. Then make sure your new syn0
and the old one are identical in shape/native-types. (Still, I'd expect creating a new syn0
for every text-example to be too slow for your needs.)
..but note the transformation will stick (not be specific to a single forward-propagation).
I'm aware of that fact but thank you for pointing it out :)
I was able to make it work. The issue was that I didn't use an augmented assignment statement for syn0
An augmented assignment expression like x += 1 can be rewritten as x = x + 1 to achieve a similar, but not exactly equal effect. In the augmented version, x is only evaluated once. Also, when possible, the actual operation is performed in-place, meaning that rather than creating a new object and assigning that to the target, the old object is modified instead.
The appearance of my snipped is in _do_train_job
(used by worker threads inside their main method worker_loop()
) which I'm overwriting in my WeightedDoc2Vec
class:
class WeightedDoc2Vec(Doc2Vec):
# ...
def _do_train_job(self, job, alpha, inits):
work, neu1 = inits
tally = 0
for doc in job:
indexed_doctags = self.docvecs.indexed_doctags(doc.tags)
doctag_indexes, doctag_vectors, doctag_locks, ignored = indexed_doctags
if self.sg:
# ...
elif self.dm_concat:
# ...
elif self.dm_weighted:
# Get word indices
windices = [self.vocab[doc.words[i]].index for i in range(len(doc.words))]
# Grab the weights
word_weights = np.asarray(doc.weights)
# Make copy of affected vectors
word_vectors_copy = self.syn0[windices].copy()
# Apply weights (important to use augmented assignment here)
self.syn0[windices] *= (np.ones((len(word_weights),self.syn0.shape[1])).T * word_weights).T
# Call the optimized 'train_document_dm' function
tally += train_document_dm(self, doc.words, doctag_indexes, alpha, work, neu1,
word_vectors=self.syn0,
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
# Restore unscaled vectors
self.syn0[windices] = word_vectors_copy
This way the overhead shouldn't be that bad since I "only" copy the affected vectors and restore their value after train_document_dm
has been called.
Please excuse me for asking this question here since it's not really actual issue regarding gensim.
TL;DR:
I'd like to know how I can get to the word vectors before they are getting propagated in order to apply transformations on them while training paragraph/document vectors.
What I'm trying to do is make a modification to
train_cbow_pair
ingensim.models.Word2Vec
. However, I struggle a bit to understand what's exactly happening there.I get, that
l1
is ist the sum of the current context window of words plus the sum of document-tag vectors that is passed totrain_cbow_pair
.Here I'm not sure what I'm looking at. In particular I struggle to understand the line
I don't know what
word.point
is describing here and why thisinput is getting propagated.Does this provide the word vectors for activating the hidden layer - which appears to be
fa
in that case? But this can't actually be the case sinceword
is actually just the current word of the context window if I get that right:So what I'd like to know is actually how I can get to the word vectors before they are getting propagated in order to apply transformations on them beforehand.