piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.71k stars 4.38k forks source link

Modifying train_cbow_pair #920

Open ghost opened 8 years ago

ghost commented 8 years ago

Please excuse me for asking this question here since it's not really actual issue regarding gensim.


TL;DR:

I'd like to know how I can get to the word vectors before they are getting propagated in order to apply transformations on them while training paragraph/document vectors.


What I'm trying to do is make a modification to train_cbow_pair in gensim.models.Word2Vec. However, I struggle a bit to understand what's exactly happening there.

I get, that l1 is ist the sum of the current context window of words plus the sum of document-tag vectors that is passed to train_cbow_pair.

def train_cbow_pair(model, word, input_word_indices, l1, alpha, learn_vectors=True, learn_hidden=True):
    neu1e = np.zeros(l1.shape)

    if model.hs:
        l2a = model.syn1[word.point]  # 2d matrix, codelen x layer1_size
        fa = 1. / (1. + np.exp(-np.dot(l1, l2a.T)))  # propagate hidden -> output
        ga = (1. - word.code - fa) * alpha  # vector of error gradients multiplied by the learning rate
        if learn_hidden:
            model.syn1[word.point] += np.outer(ga, l1)  # learn hidden -> output
        neu1e += np.dot(ga, l2a)  

    # ...

Here I'm not sure what I'm looking at. In particular I struggle to understand the line

l2a = model.syn1[word.point] 

I don't know what word.point is describing here and why thisinput is getting propagated.

Does this provide the word vectors for activating the hidden layer - which appears to be fa in that case? But this can't actually be the case since word is actually just the current word of the context window if I get that right:

# train_document_dm calling train_cbow_pair
def train_document_dm(model, doc_words, doctag_indexes, alpha, work=None, neu1=None,
                      learn_doctags=True, learn_words=True, learn_hidden=True,
                      word_vectors=None, word_locks=None, doctag_vectors=None, doctag_locks=None):
    # ...
    for pos, word in enumerate(word_vocabs):
        # ...
        neu1e = train_cbow_pair(model, word, word2_indexes, l1, alpha,
                                learn_vectors=False, learn_hidden=learn_hidden)
        # ...

So what I'd like to know is actually how I can get to the word vectors before they are getting propagated in order to apply transformations on them beforehand.

gojomo commented 8 years ago

Note that in practice, train_cbow_pair() is only the pure-Python implementation, which is rarely used because it can be 80x (or more) slower that the alternate optimized Cython path.

The 'input' word vectors are in syn0. You could modify them there, if you want to apply your transformations to affect all subsequent training. You'd have to get to them before l1 is composed, if you want to transform them for each training-example before they're forward-propagated. (With more idea of what kind of transformation you'd like to apply, and why, it might be possible to make a better recommendation.)

Note that word.point is actually a list of indexes into the hierarchical-softmax tree: all the output nodes that (together, if they each have the 0/1 activations in the corresponding codes) represent the target word. (Personally I find the negative-sampling codepaths easier to understand, and many projects, especially with large vocabularies and large corpuses, seem to prefer negative-sampling training. In negative sampling, there's a single output node corresponding to a target word, and thus the weights into that are a sort of 'output vector' for that word, unlike in hierarchical-softmax where a word is implied by a pattern of codes activations in the multiple 'points` output nodes.)

ghost commented 8 years ago

@gojomo I started to notice that the Python path is incredibly much slower than the Cython path. Is there a way for me to replace the compilation with my own?

The transformations I'd like to apply are actually simple scaling operations - I just want to apply weights to the input vectors dynamically.

gojomo commented 8 years ago

The Cython code is not compiled Python - it's alternate code, in the Cython python-like language, written under more constraints that allow better performance (and greater multithreaded paralleism). All the Cython source is available for you to modify just like the Python code – see the files ending .pyx or .pxd. However, getting things right can be trickier – more like C-programming – and the .pyx is compiled autogenerated .c files (which should not be edited) to then become the shared-library usable from Python.

If the weights are the same for all examples, you could directly modify syn0 before training, or interspersed with training (in the pure-Python code). But to apply different scaling every text-example/training-example, and still run the optimized code, you'll likely need to modify the Cython code.

ghost commented 8 years ago

@gojomo Thank you for explaining that to me - I never had anything to do with Cython so far but it's good to know what I'm actually dealing with here :)

Unfortunately, since the weights are not the same for all words, I cannot apply the weight to syn0. At the moment I put it like this

l1 = np.sum((word_vectors[word2_indexes].T * word_weights[idx]).T, axis=0) + np.sum(doctag_vectors[doctag_indexes], axis=0)

but as you already said this is very slow now.

The only way I can potentially see around this, without having to alter the Cython implementation, would be the input parameter word_vectors of train_document_dm.

def train_document_dm(model, doc_words, #...
                      word_vectors=None, 
                     # ... 
                      ):

    if word_vectors is None:
        word_vectors = model.syn0
    # ...

One could simply set word_vectors = model.syn0 beforehand but apply unique weights on certain vectors required in the current sentence.

The only thing I don't know here is where it's getting decided whether the Python or the Cython path is going to be used.

ghost commented 8 years ago

I'm thinking about something like this:

    elif self.dm_weighted:

        windices = [i for i in range(len(doc.words)) if doc.words[i] in self.vocab and
                    self.vocab[doc.words[i]].sample_int > self.random.rand() * 2 ** 32]

        word_weights = np.asarray([doc.weights[i] for i in windices])

        # Make copy of affected vectors for later restoration
        word_vectors_copy = self.syn0[windices].copy()

        # Apply weights
        self.syn0 = (self.syn0[windices].T * word_weights).T

        tally += train_document_dm(self, doc.words, doctag_indexes, alpha, work, neu1,
                                   # Pass those word vectors directly
                                   word_vectors=self.syn0, 
                                   doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)

        # Restore unscaled vectors
        self.syn0[windices] = word_vectors_copy

Unfortunately it does not run. It appears that changing self.syn0 causes a problem. The program does run but at some point it's simply crashing.

gojomo commented 8 years ago

You could transform syn0 in place with different weights per word, but note the transformation will stick (not be specific to a single forward-propagation).

The word_vectors parameter is already self.syn0 in all cases. You could replace it with something else – your transformed syn0 – but as syn0 is typically quite large, and train_document_dm() is called for every text example, that'd be a costly operation.

When the code loads, either the Cython methods are imported, or the Python alternatives are defined. So it's not decided each time - only one or the other implementation of each method is present. See https://github.com/RaRe-Technologies/gensim/blob/a84f64e7b617a3983c2b332c8383e1a30b14db5d/gensim/models/doc2vec.py#L60

I'm not sure where your example snippet code might be intended to appear, but a crash likely indicates your new syn0 isn't the exact shape/native-type as the optimized code as the original/expected. You probably don't need to copy syn0 - just hold a reference to the original array. Your later assignment isn't going to re-write that object in place, but replace the reference. Then make sure your new syn0 and the old one are identical in shape/native-types. (Still, I'd expect creating a new syn0 for every text-example to be too slow for your needs.)

ghost commented 8 years ago

..but note the transformation will stick (not be specific to a single forward-propagation).

I'm aware of that fact but thank you for pointing it out :)


I was able to make it work. The issue was that I didn't use an augmented assignment statement for syn0

An augmented assignment expression like x += 1 can be rewritten as x = x + 1 to achieve a similar, but not exactly equal effect. In the augmented version, x is only evaluated once. Also, when possible, the actual operation is performed in-place, meaning that rather than creating a new object and assigning that to the target, the old object is modified instead.

The appearance of my snipped is in _do_train_job (used by worker threads inside their main method worker_loop()) which I'm overwriting in my WeightedDoc2Vec class:

class WeightedDoc2Vec(Doc2Vec):

    # ...

    def _do_train_job(self, job, alpha, inits):
        work, neu1 = inits
        tally = 0
        for doc in job:
            indexed_doctags = self.docvecs.indexed_doctags(doc.tags)
            doctag_indexes, doctag_vectors, doctag_locks, ignored = indexed_doctags
            if self.sg:
                # ...
            elif self.dm_concat:
                # ...
            elif self.dm_weighted:

                # Get word indices 
                windices = [self.vocab[doc.words[i]].index for i in range(len(doc.words))]

                # Grab the weights
                word_weights = np.asarray(doc.weights)

                # Make copy of affected vectors
                word_vectors_copy = self.syn0[windices].copy()

                # Apply weights (important to use augmented assignment here)
                self.syn0[windices] *= (np.ones((len(word_weights),self.syn0.shape[1])).T * word_weights).T

                # Call the optimized 'train_document_dm' function
                tally += train_document_dm(self, doc.words, doctag_indexes, alpha, work, neu1,
                                           word_vectors=self.syn0,
                                           doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)

                # Restore unscaled vectors
                self.syn0[windices] = word_vectors_copy

This way the overhead shouldn't be that bad since I "only" copy the affected vectors and restore their value after train_document_dm has been called.