piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.56k stars 4.37k forks source link

conversion function naming #1270

Closed amueller closed 7 years ago

amueller commented 7 years ago

Hey. I'm trying to go from the CSR format used in scikit-learn to the gensim for mat and I'm a bit confused. There is some instructions here: https://radimrehurek.com/gensim/tut1.html#compatibility-with-numpy-and-scipy

But the naming seems odd. Why is "corpus to CSC" the inverse of "sparse to corpus"? Looking at the helper functions here is even more confusing imo.

Does "corpus" mean an iterator over lists of tuples or what is the interface here? There are some other functions like:


gensim.matutils.sparse2full(doc, length)

    Convert a document in sparse document format (=sequence of 2-tuples) into a dense np array (of size length).

and full2sparse. In this context "sparse" means sequence of 2-tuples, while in the "Sparse2Corpus" the "sparse" means "scipy sparse matrix".

Is it possible to explain what "sparse", "scipy", "dense" and "corpus" mean in all these functions? It seems to me like there is no consistent convention.

tmylk commented 7 years ago

In this context corpus is a list/iterator/generator of tuples in bag of words format.

There is more context in any2sparse

What is the context of this conversion? We have sklearn pipeline interface for LDA and LSI in gensim as described in this ipynb

amueller commented 7 years ago

I want to use the word2vec though ;) Maybe it would be good to have a general wrapper that can be applied to any transformation?

The context is that I'm trying to teach my students about word2vec using gensim and we have only used the sklearn representation so far. I think I got the representation but I'm still confused by the naming.

So in any2sparse, sparse again means the gensim format, so the opposite from "SparseToCorpus", i.e. these go into the same direction even though the naming suggests they go in opposite directions? any2sparse also only works on lists of vectors it seems, which makes sense for streaming but is not what you'd have in sklearn.

tmylk commented 7 years ago

The input to word2vec is not a corpus aka a list of tuples, but an iterable of lists of words - sentences. For example LineSentence

tmylk commented 7 years ago

Actually the simplest gensim - sklearn word2vec integration code is in shorttext package https://pythonhosted.org/shorttext/tutorial_sumvec.html

amueller commented 7 years ago

I only want to transform, not train, so then the interface is word-based, right?

amueller commented 7 years ago

thanks for the hint for shorttext. That doesn't have paragraph2vec, though, right? Btw, is there a pretrained model for that?

tmylk commented 7 years ago

model.wv[['office', 'products']] returns the vector representation as in shorttext here

Not aware of large doc2vec pre-trained model. This week there will be a small trained doc2vec model with tensorboard viz in this PR by @parulsethi

amueller commented 7 years ago

@tmylk awesome, thanks! Still think you need to work on your conversion function naming ;)

amueller commented 7 years ago

pretrained doc2vec here: https://github.com/jhlau/doc2vec though unclear if that's applicable to other domains.

amueller commented 7 years ago

somewhat unrelated, have you thought about including the feature of using a pretrained word model for the doc2vec as done here https://github.com/jhlau/gensim/commit/9dc0f798f46713c4efadf5a9953d929b9ee0b073 ?

tmylk commented 7 years ago

Initialization of word vectors by pre-trained is possible to do manually in the main branch without that fork. Though it's debated on the mailing list by @gojomo whether that's helpful or not.

amueller commented 7 years ago

hm upgraded to 1.0.1 model.wv still doesn't exist.

tmylk commented 7 years ago

That is strange

import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
model['first']
model.wv['first']
amueller commented 7 years ago

ah, It's probably because I load the model?

from gensim import models
w = models.KeyedVectors.load_word2vec_format(
    '../GoogleNews-vectors-negative300.bin', binary=True)
tmylk commented 7 years ago

that is not the model, that is just the vectors from the model :) You cannot train it, just read-only query with them. w['first'] will work then

gojomo commented 7 years ago

Looking at this issue history, I see @amueller comments that seem to be reating to @tmylk answers... but no @tmylk comments at all. Some github bug?

If you're loading directly into a KeyedVectors, no need to access the .wv property - your object is already the vectors.

There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in intersect_word2vec_format(). And folks who study the source/structures can try stitching such a model together (similar to the @jhlau code referenced). But it seems to me a lot more people want to do that, than have a good reason or understanding for why it should be done, and so I'd like to see some experiments/write-ups demonstrating the value (and limits) of such an approach before adding any further explicit support. (One of the best Doc2Vec modes for many applications, pure PV-DBOW without word-training, doesn't even use/create input/projection word-vectors.)

jhlau commented 7 years ago

There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in intersect_word2vec_format(). And folks who study the source/structures can try stitching such a model together (similar to the @jhlau code referenced). But it seems to me a lot more people want to do that, than have a good reason or understanding for why it should be done, and so I'd like to see some experiments/write-ups demonstrating the value (and limits) of such an approach before adding any further explicit support. (One of the best Doc2Vec modes for many applications, pure PV-DBOW without word-training, doesn't even use/create input/projection word-vectors.)

We've done just that. It's all documented in this paper: https://arxiv.org/abs/1607.05368

Long story short, pre-trained word embeddings help most when you are using a small document collection (e.g. a special domain of text) when training doc2vec.

piskvorky commented 7 years ago

I can also see only @amueller 's side of the conversation.

Sparse2Corpus should probably be called Scipy2Sparse, for consistency.

The confusion comes from the fact that both scipy and gensim have been calling their data structure "sparse", for almost a decade now... :( In scipy, it denotes a sparse matrix in CSR / CSC / whatever; in gensim it's anything that you can iterate over, yielding iterables of (feature_id, feature_weight) 2-tuples.

Maybe call it "gensim-sparse" vs "scipy-sparse"?

I'm also +1 on renaming the generic gensim structure to something else entirely. "Sparse" is taken (scipy). "Corpus" is taken (NLP). Any other ideas?

gojomo commented 7 years ago

@jhlau Thanks for your comment & analysis - but I found some of the parameter-choices and evaluations/explanations in your paper confusing, to the point of not being convinced of that conclusion. Some of my observations are in the gensim forum messages at https://groups.google.com/d/msg/gensim/MYbZBkM5KKA/lBKGf7WNDwAJ and https://groups.google.com/d/msg/gensim/MYbZBkM5KKA/j5OKViKzEgAJ. As an example, the claim in section 5 – "More importantly, using pre-trained word embeddings never harms the performance" – is directly contradicted by the above-referenced table, where on several of the subcollections, the non-pretrained DBOW outperforms either one or the other choice of pretrained word-vectors. (And on the 'programmers' forum, it outperforms both.)

jhlau commented 7 years ago

but I found some of the parameter-choices and evaluations/explanations in your paper confusing, to the point of not being convinced of that conclusion.

Not sure what you are confused about, but looking at your comments on the links:

Pure PV-DBOW (dm=0, dbow_words=0) mode is fast and often a great performer on downstream tasks. It doesn't consult or create the traditional (input/'projection-layer') word-vectors at all. Whether they are zeroed-out, random, or pre-loaded from word-vectors created earlier won't make any difference.

PV-DBOW with concurrent skip-gram training (dm=0, dbow_words=1) will interleave wordvec-training with docvec-training. It can start with random word-vectors, just like plain word-vector training, and learn all the word-vectors/doc-vectors together, based on the current training corpus. The word-vectors and doc-vectors will influence each other, for better or worse, via the shared hidden-to-output-layer weights. (The training is slower, and the doc-vectors essentially have to 'share the coordinate space' with the word-vectors, and with typical window values the word vectors are in-aggregate getting far more training cycles.)

PV-DM (dm=1) inherently mixes word- and doc-vectors during every training example, but also like PV-DBOW+SG, can start with random word-vectors and learn all that's needed, from the current corpus, concurrently during training.

That seems to correspond to my understanding of doc2vec. What we found is that pure PV-DBOW ('dm=0, dbow_words=0') is pretty bad. PV-DBOW is generally the best option ('dm=0, dbow_words=1'), and PV-DM ('dm=1') at best performs on-par with PV-DBOW, but is often slightly worse and requires more training iteration (since its parameter size is much larger).

Feel free to ask any specific questions that you feel are not clear. I wasn't aware of any these discussions as no one has tagged me.

There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in intersect_word2vec_format().

This function does not really work, as it uses pre-trained embeddings only for words that are in the model. The forked version of gensim that I've built on the other hand also loads new word embeddings. That is the key difference.

in section 5 – "More importantly, using pre-trained word embeddings never harms the performance" – is directly contradicted by the above-referenced table, where on several of the subcollections, the non-pretrained DBOW outperforms either one or the other choice of pretrained word-vectors. (And on the 'programmers' forum, it outperforms both.)

On section 5 table 6, we really meant is that adding pre-trained word vectors doesn't harm performance substantially. Overall, we see that using pre-trained embeddings is generally beneficial for small training collection, and at the worst case, it'd give similar performance, and therefore there's little reason to not do it.

piskvorky commented 7 years ago

Nice thread hijacking! 😆

Perhaps mailing list better?

amueller commented 7 years ago

I take all the blame for mixing about 10 issues into one.

The confusion comes from the fact that both scipy and gensim have been calling their data structure "sparse", for almost a decade now

exactly, that was confusing for me. If you use corpus in some consistent way that would be fine for me, but I'm not an NLP person, who might be confused by that. Not sure what kind of data structures nltk has for example.

amueller commented 7 years ago

One of the best Doc2Vec modes for many applications, pure PV-DBOW without word-training, doesn't even use/create input/projection word-vectors.

Can you give a reference for that - even how that works? That's not described in the original paper, right? [Sorry for highjack-continuation, I'm already on too many mailing lists. maybe separate issue?]

gojomo commented 7 years ago

@piskvorky - Could discuss on gensim list if @jhlau would also like that forum, but keeping full context here for now.

@jhlau -

For background, I am the original implementor of the dm_concat and dbow_words options in gensim, and the intersect_word2vec_format() method.

That seems to correspond to my understanding of doc2vec. What we found is that pure PV-DBOW ('dm=0, dbow_words=0') is pretty bad. PV-DBOW is generally the best option ('dm=0, dbow_words=1'), and PV-DM ('dm=1') at best performs on-par with PV-DBOW, but is often slightly worse and requires more training iteration (since its parameter size is much larger).

I didn't see any specific measurements in the paper about pure PV-DBOW – am I misreading something? (There, as here, I only see statements to the effect of, "we tried it but it was pretty bad".)

As mentioned in my 2nd-referenced-message, comparing pure PV-DBOW with arguments like dm=0, dbow_words=0, iter=n against PV-DBOW-plus-skip-gram with arguments like dm=0, dbow_words=1, window=15, iter=n may not be checking as much the value of words, but the value of the 16X-more training effort (which happens to be mostly focused on words). A more meaningful comparison would be dm=0, dbow_words=0, iter=15*n vs dm=0, dbow_words=1, window=15, iter=n – which I conjecture would have roughly the same runtime. With no indication such an apples-to-apples comparison was made, I can't assign much weight to the unquantified "pretty bad" assessment.

From the paper's description & your posted code, it appears all pvdm tests were done with the non-default dm_concat=1 mode. As noted in my message, I've not yet found any cases where this mode is worth the massive extra time/memory overhead. (It's unfortunate that the original Mikolov/Le paper touts this method, but implementations are rare, and so people may think it's the key to their non-reproducible results.) I try to warn all but the most adventurous, rigorous users away from this mode, and perhaps the gensim doc-comment should be even more discouraging. But the upshot is that if all your paper's pvdm tests were with dm_concat=1, they are unlikely generalizable to the more practical and commonly-used mode dm=1, dm_concat=0 mode.

There's experimental support for merging some pretrained-word-vectors into a prediscovered vocabulary, in intersect_word2vec_format().

This function does not really work, as it uses pre-trained embeddings only for words that are in the model. The forked version of gensim that I've built on the other hand also loads new word embeddings. That is the key difference.

Yes, but if someone is only computing doc-vectors over a current corpus C, and will be doing further training over just examples from current corpus C, and further inference just using documents from corpus C, why would any words that never appear in C be of any value? Sure, earlier larger corpus P may have pre-trained lots of other words. But any training/inference on C will never update or even consult those slots in the vector array, so why load them?

Now, there might be some vague intuition that bringing in such words could help later, when you start presenting new documents for inference, say from some new set D, that have words that are outside the vocabulary of C, but were in P. But there are problems with this hope:

These subtle issues are why I'm wary of a superficially-simple API to "bring in pretrained embeddings" That would make that step seem like an easy win, when I don't yet consider the evidence for that (including your paper) to be strong. And it introduces tradeoffs and unintuitive behaviors with regard to the P-but-not-C vocabulary words, and the handling of D examples with such words.

I see the limits and lock-options of intersect_word2vec_format() as somewhat protecting users from unwarranted assumptions and false intuitions about what imported-vectors might achieve. And even with all this said, if a user really wants words in their model imported from P that have made-up frequency values, and can't be meaningfully tuned by training over C, and may inject some arbitrary randomness in later inference over documents like those in D, I would still suggest leveraging intersect_word2vec_format(). For example, they could add a few synthetic texts to their C corpus, with the extra P words – and these noise docs are unlikely to have much effect on the overall model quality. Or, they can call the three submethods of build_vocab()scan_vocab(), scale_vocab(), finalize_vocab() – separately, and manually add entires for the extra P words just after scan_vocab(). These few lines of code outside the Word2Vec model can achieve the same effects, but avoid the implied endorsement of an option that presents a high risk of "shooting-self-in-foot".

On section 5 table 6, we really meant is that adding pre-trained word vectors doesn't harm performance substantially. Overall, we see that using pre-trained embeddings is generally beneficial for small training collection, and at the worst case, it'd give similar performance, and therefore there's little reason to not do it.

The benefits in that table generally look small to me, and I suspect they'd be even smaller with the fairer training-time comparison I suggest above. But "never harms" (with italicized emphasis!) was an unsupportable word choice if in fact you really meant 'substantially', and the adjacent data table provides actual examples where pre-trained embeddings harmed the evaluation score. Such a mismatch also lowers my confidence in all nearby claims.

gojomo commented 7 years ago

@amueller –

One of the best Doc2Vec modes for many applications, pure PV-DBOW without word-training, doesn't even use/create input/projection word-vectors.

Can you give a reference for that - even how that works? That's not described in the original paper, right? [Sorry for highjack-continuation, I'm already on too many mailing lists. maybe separate issue?]

The original Paragraph Vectors paper only describes that PV-DBOW mode: the doc-vector-in-training, alone, is optimized to predict each word in turn. It's not averaged with any word-vectors, nor does the paper explicitly describe training word-vectors at the same time – though it's a naturally composable approach, given how analogous PV-DBOW is with skip-gram words, with the PV-DBOW doc-vector being like a magic pseudo-word that, within one text example, has an 'infinite' effective window, floating into every context.

That 'floating word' is indeed how Mikolov's small patch to word2vec.c, adding a -sentence-vector option to demonstrate Paragraph Vectors, worked. The first token on any line was treated as this special, contributes-to-every-context word – and regular word-vector training was always still happening. So by my interpretation, that demonstration was not a literal implementation of the original Paragraph Vectors paper algorithm, but a demo of combined, interleaved PV & word-vector training.

The followup paper, "Document Embeddings with Paragraph Vector" (https://arxiv.org/abs/1507.07998) seems to share my interpretation, because it observes that word-vector training was an extra option they chose (section 3 paragraph 2):

We also jointly trained word embeddings with the paragraph vectors since preliminary experiments showed that this can improve the quality of the paragraph vectors.

However, the only places this paper compares "PV w/out word-training" against PV-with-word-training, in figures 4 and 5, the without-word-training is very similar in evaluation score, and even better at 1-out-of-4 comparison points (lower dimensionality in figure 4). And I suspect the same conjecture I've made about @jhlau's results, that using some/all of the time saved from not-training-words to do more iterations of pure-DBOW, would be a fairer comparison and further improve plain PV-DBOW's relative performance.

jhlau commented 7 years ago

I didn't see any specific measurements in the paper about pure PV-DBOW – am I misreading something? (There, as here, I only see statements to the effect of, "we tried it but it was pretty bad".)

Indeed. Its performance is far worse than PV-DBOW with SG so we omit from including them entirely.

As mentioned in my 2nd-referenced-message, comparing pure PV-DBOW with arguments like dm=0, dbow_words=0, iter=n against PV-DBOW-plus-skip-gram with arguments like dm=0, dbow_words=1, window=15, iter=n may not be checking as much the value of words, but the value of the 16X-more training effort (which happens to be mostly focused on words). A more meaningful comparison would be dm=0, dbow_words=0, iter=15*n vs dm=0, dbow_words=1, window=15, iter=n – which I conjecture would have roughly the same runtime. With no indication such an apples-to-apples comparison was made, I can't assign much weight to the unquantified "pretty bad" assessment.

I disagree that is a fairer comparison. What would be a fairer comparison, though, is that you extract the most optimal performance from both methods. If PV-DBOW without SG takes longer to converge to optimal performance, then yes I agree that one should train it more (but not by arbitrarily setting some 'standardised' epoch number). I did the same when comparing with PV-DM - it uses much more training epochs but the key point is finding its best performance. I might go back and run PV-DBOW without SG to check if this is the case.

From the paper's description & your posted code, it appears all pvdm tests were done with the non-default dm_concat=1 mode. As noted in my message, I've not yet found any cases where this mode is worth the massive extra time/memory overhead. (It's unfortunate that the original Mikolov/Le paper touts this method, but implementations are rare, and so people may think it's the key to their non-reproducible results.) I try to warn all but the most adventurous, rigorous users away from this mode, and perhaps the gensim doc-comment should be even more discouraging. But the upshot is that if all your paper's pvdm tests were with dm_concat=1, they are unlikely generalizable to the more practical and commonly-used mode dm=1, dm_concat=0 mode.

The intention is about checking the original paragraph vector, so yes I only experiment with dm_concat=1 option. In terms of observations we found what you've seen, that the increased number of parameters is hardly worth it.

Yes, but if someone is only computing doc-vectors over a current corpus C, and will be doing further training over just examples from current corpus C, and further inference just using documents from corpus C, why would any words that never appear in C be of any value? Sure, earlier larger corpus P may have pre-trained lots of other words. But any training/inference on C will never update or even consult those slots in the vector array, so why load them?

Not quite, because often there is vocab filter for low frequency words. A word might have been filtered out due to this frequency threshold and excluded from the dataset, but it could be included back again when you are importing it from a larger pre-trained word embeddings model.

Now, there might be some vague intuition that bringing in such words could help later, when you start presenting new documents for inference, say from some new set D, that have words that are outside the vocabulary of C, but were in P. But there are problems with this hope:

That wasn't quite the intention behind why the new vocab is included, for all the reasons you pointed out below.

The benefits in that table generally look small to me, and I suspect they'd be even smaller with the fairer training-time comparison I suggest above. But "never harms" (with italicized emphasis!) was an unsupportable word choice if in fact you really meant 'substantially', and the adjacent data table provides actual examples where pre-trained embeddings harmed the evaluation score. Such a mismatch also lowers my confidence in all nearby claims.

Fair point. The wording might have been a little strong but I stand by what I said previously and the key point is take a step back and look at the bigger picture. Ultimately the interpretation is up to the users -they can make the choice whether they want to incorporate pre-trained embeddings or not.

gojomo commented 7 years ago

Indeed. Its performance is far worse than PV-DBOW with SG so we omit from including them entirely.

My concern is that without seeing the numbers, & knowing what parameters were tested, it's hard to use this observation to guide future work.

I disagree that is a fairer comparison. What would be a fairer comparison, though, is that you extract the most optimal performance from both methods. If PV-DBOW without SG takes longer to converge to optimal performance, then yes I agree that one should train it more (but not by arbitrarily setting some 'standardised' epoch number). I did the same when comparing with PV-DM - it uses much more training epochs but the key point is finding its best performance. I might go back and run PV-DBOW without SG to check if this is the case.

Sure, never mind any default epoch-counts (or epoch-ratios). The conjecture is that even though PV-DBOW-without-SG may benefit from more epochs, these are so much faster (perhaps ~15X in the window=15 case) that a deeper search of its parameters may still show it to be a comparable- or top-performer on both runtime and final-scoring. (So it doesn't "take longer" in tangible runtime, just more iterations in same-or-less time.)

If you get a chance to test that, in a comparable set-up to the published results, I'd love to see the numbers and it'd give me far more confidence in any conclusions. (Similarly, the paper's reporting of 'optimal' parameters in Table 4, §3.3, and footnotes 9 & 11 would be far more informative if it also reported the full range of alternative values tried, and in what combinations.)

The intention is about checking the original paragraph vector, so yes I only experiment with dm_concat=1 option. In terms of observations we found what you've seen, that the increased number of parameters is hardly worth it.

I understand that choice. But given the dubiousness of the original paper's PV-DM-with-concatenation results, comparative info about the gensim default PV-DM-with-averaging mode could be more valuable. That mode might be competitive with PV-DBOW, especially on large datasets. So if you're ever thinking of a followup paper...

Not quite, because often there is vocab filter for low frequency words. A word might have been filtered out due to this frequency threshold and excluded from the dataset, but it could be included back again when you are importing it from a larger pre-trained word embeddings model.

I see. That's an interesting subset of the combined vocabulary – but raises the same concerns about vector-quality-vs-frequency as come into play in picking a min_count, or deciding if imported vectors should be tuned by the new examples, or frozen in place (perhaps because the pretraining corpus is assumed to be more informative. In some cases discarding words that lack enough training examples to induce 'good' vectors improves the quality of the surviving words, by effectively shrinking the distances between surviving words, and removing de facto interference from the lower-resolution/more-idiosyncratic words. So I could see the bulk import, and thus 'rescue' of below-min_count words in the current corpus, as either helping or hurting – it'd need testing to know. It's even within the realm-of-outside-possibility that the best policy might be to only pre-load some lowest-frequency words – trusting that more-frequent words are best trained from their plentiful domain-specific occurrences. Such policies could be explored by users by direct-tampering with the model's vocabulary between the scan_vocab() and scale_vocab() initialization steps.

amueller commented 7 years ago

@gojomo Ah, in the original paper i thought they implemented Figure 2 but they actually implemented Figure 3 (I only skimmed).

gojomo commented 7 years ago

@amueller - I'd describe it as, figure-2 is PV-DM (available in gensim as dm=1, with potential submodes controlled by dm_mean and dm_concat), and figure-3 is PV-DBOW (available in gensim as dm=0, with potential skip-gram training interleaved with dbow_words=1).

tmylk commented 7 years ago

Closing as resolved open-ended discussion.