ryankohl commented 9 years ago

After training a Doc2Vec model (with the Brown corpus), I went ahead and attempted scoring. However, I got the following error: AttributeError: 'Doc2Vec' object has no attribute '_prepare_items'

And indeed, neither Word2Vec nor Doc2Vec seem to have that method. In an older version of gensim (0.10.3), there was reference to a '_prepare_sentences', which was indeed a method in Doc2Vec.

Guessing the method just got lost in a refactor?

gojomo commented 9 years ago

Yes – I seemed to have refactored away a method @mataddy's scoring work relied upon.

You can probably restore functionality by finding the _prepare_items method that existed at the same time as the line calling it.

More generally, the score* methods could parallel the refactorings in the train* paths, and a unit test would help guard against future breakage.

ryankohl commented 9 years ago

Turns out that there may be another path - restoring the _prepare_items method lead to a cascade of other tweaks needed, so I looked around and found the infer_vector() method, which seems to do the job. However, that method was present back in 0.10.3, so perhaps the score() method was doing something different. At the end of the day, I'm only looking for word/paragraph vectors, so perhaps infer_vector() is the right call?

mataddy commented 9 years ago

hi @gojomo and @ryankohl there is no score method for Doc2Vec, so I'm a bit confused. But regardless if _prepare_items has indeed been removed from Word2Vec after the last release then I'll make sure to parallel the changes in train asap. I'm travelling right now but will take a look next chance I get.

also @gojomo I will add a unit test to TestWor2VecModel in the near future.

ryankohl commented 9 years ago

They @mataddy - the score method is inherited from the Word2Vec class.

mataddy commented 9 years ago

thanks @ryankohl, that makes sense.

gojomo commented 9 years ago

I'm not sure 'score' has a strong/straightforward interpretation for Doc2Vec – except indirectly because of the similarity of many of the training modes.... which might lead to plausibly-useful results even without intentional design.

infer_vector returns a model-compatible N-dimensional doc vector for new input – so I don't see it as being directly analogous to the (scalar) 'score' value. Though perhaps, if you squint at it in some corner case, it can have similar utility? (Maybe if you do a single-pass infer, starting from an all-zeros vector, the resulting vector's magnitude is somehow roughly correlative to 'score'? Or if you keep running the inference gradient-descent until the internal error stops getting better, the best error achievable, in a full-document pass, is roughly usable like a 'score'?)

Ultimately it may be possible to merge score-returning functionality with the train methods: have one extra parameter, default false, as to whether the errors-during-training should be accumulated aside somewhere in the proper manner. (Or maybe that's just what they would do when all the optional train_words/train_hidden switches are off?) But I'd only want to do this if the readability/performance impact in the usual real-training case is minimal.

mataddy commented 9 years ago

@gojomo I agree on first point: score for w2v is based upon interpretation of the objective as a pseudo-likelihood, and I don't think I could easily stretch this interpretation to d2v. But that doesn't mean it wouldn't give something useful! Since it will inevitably be inherited, I'll at least make sure it doesn't break in d2v.

need to think more on 2nd point.

on 3rd: yep, that's definitely the way to go. it should just require an 'score' flag that then leads to an early exit and returns the scores. I should be able to implement that without any effect on read/perform. I hadn't done that originally just to give me the freedom to experiment without messing with train.

piskvorky commented 9 years ago

This came up on the gensim mailing list as well: https://groups.google.com/forum/#!topic/gensim/cGxtQUf0_2s

mataddy commented 9 years ago

This is fixed in the pull request at #417.

I ended up just porting the updates to train() into my score() set of functions. I think I've got a good grasp on how @gojomo refactored; makes sense and looks nice.

There are enough real differences between scoring and training that I didn't see an obvious way to subsume score() into train() via an argument flag without leading to more overhead in scoring than we need. There is a lot of repeated code however; perhaps I could figure out a way that they could share some functions.

my next task will be to write a higher level wrapper 'probability' function that uses score() to return class probabilities. Then I'll write up the tutorial that I promised @piskvorky a while back. Off on a bunch more travel now so it won't be for a couple of weeks that I finish those.

mataddy commented 9 years ago

PS @gojomo I also added a Scoring test in test_word2vec

piskvorky commented 9 years ago

417 merged, closing this issue. Thanks Matt!

A tutorial will be amazing indeed :) We've pushed a lot of new functionality lately, which tutorial(s) will make much easier to discover / promote.

piskvorky / gensim

AttributeError: 'Doc2Vec' object has no attribute '_prepare_items' #407

417 merged, closing this issue. Thanks Matt!