Addition of entity embedding model

shubham0704 commented 6 years ago

It would be really nice to have the entity embedding model to become part of the family of algorithms at gensim. Here is a short write up about the same. Reference paper - > link

menshikh-iv commented 6 years ago

CC: @gojomo @piskvorky though? Is this a good addition to Gensim?

piskvorky commented 6 years ago

We already have embeddings in Gensim -- word2vec, fasttext, LSI etc. How is this different? Sorry, I didn't understand the write-up / motivation for this feature.

shubham0704 commented 6 years ago

The inspiration

Whenever we work with data which is structured, for example We log the click stream data of users in some database for a task of say product recommendation. We use techniques like tf-idf or say tree based methods and some other techniques.

We are unable to make use of neural networks efficiently for this task to get competitive performance. Also Companies regularly work with databases involving structured datasets. It would be awesome if we could have a neural network which could do all these jobs efficiently and be a better performer than even tree based methods for these tasks. In the writeup I have mentioned the tasks it is acing as of now.

Current Problems faced by Neural Networks

The continuous nature of neural networks limits their applicability to categorical variables. Therefore, naively applying neural networks on structured data with integer representation for category variables does not work well.1

Apparently after reading more I have found them to be very similar to our word2vec. Since they too represent higher dimensional structured data. The authors say its inspired from the NLP models.

menshikh-iv commented 6 years ago

Probably "close" was a misclick

ping @piskvorky

piskvorky commented 6 years ago

@shubham0704 can you please give one concrete use-case?

I understand this is something to do with structured data and databases. But I still don't really understand the motivation -- what is the input of this technique, what is its output, when would I use this?

shubham0704 commented 6 years ago

@piskvorky my apologies for not being very specific about this.

By concrete example I interpret you want to know about at least one use case whose input, output and results are properly defined accompanied by some code. In that case - The use-case we are considering here is - Time series prediction problem (Forecasting Sales) here is a notebook going through the problem in detail. Note the real stuff starts from input 90. I am writing something below which might provide some intuition. One of the input table looks like this -

screenshot from 2018-04-03 20-11-37 This contains both discrete and continuous features/columns.

Now often integers are used to represent by integers and real numbers respectively. We then represent them as one hot encoded vectors.
Consider a feature which can take discrete values say [1,2,3] we map each of them to a continuous to a vector. example - Fruit can take [apple, mango, banana.] -> get a unique id for each of them, convert then as one-hot encoded. But we do not stop here we pass the one-hot encoded to the entity embedding layer. This layer is nothing but a linear addition to our previous one-hot encoded output. So the initial embedding layer learns about the intrinsic properties about each category and the deeper layers learn about the complex relations between these categories. So if you take the weights of each layer and inspect them you might find some hidden relations between the data points. i.e we not only want to inspect relations between categories but each of the dimension of an embedding layer would reveal some new relation.(Just like those matrices in SVD - that embed hidden concepts) The architecture is given below - The embeddings of categorial variables from initial layers can be used to initialize other models which can then be used for prediction tasks.

gojomo commented 6 years ago

To an extent, in word2vec words are already 'categorical' - the word X essentially makes the contexts it appears in have the pseudocategory 'word_X_appears_here'. From a quick skim of the paper (which might have left me with misconceptions), it seems this technique also mixes in continuous values as (extra) input dimensions during training, and potentially uses continuous target values (rather than predicting discrete target words) to evaluate outputs & back-propagate corrections. In that 2nd way, it's a bit like how Fasttext, in classification mode, is optimizing the word-vectors to be better at predicting classes rather than just neighbor-words. Those 2 continuous things stick out to me as what's interesting here, moreso than the title "Entity Embeddings of Categorical Variables" suggests – which sounds like no more than treating-categories-like-words.

It'd be good to see more cases/datasets where it is believed to perform well to know its overall value. It also definitely veers into the sort of supervised/semi-supervised or numeric-rather-than-natural-language realms that are sometimes considered out-of-scope for gensim. (But in my opinion, if using available class-data or continuous-numerical-data helps improve text-vectors for particular tasks, gensim should welcome support for those to remain competitive.)

I'd view it as a another 'near-cousin' of word2vec/doc2vec/fasttext that might work as a motivating example to drive the creation of some common highly-configurable core.

gojomo commented 6 years ago

I just noticed that the Facebook 'StarSpace' generalized-embedding tool has the option of continuous real-valued inputs & output targets, via its useWeights option. So the same features/flexibility in the training-core might enable both matching this model & StarSpace.

shubham0704 commented 6 years ago

What do you suggest @gojomo ? Do you want to have the Entity Embeddings model in gensim. Or do you want to augment/modify some model in gensim which could provide the same behaviour as StarSpace.

gojomo commented 6 years ago

A standalone implementation that repeats a lot of Word2Vec/Doc2Vec code would only be justified if there were good vivid examples of where this technique provides unique benefits (and thus significant demand for it). Are there such examples?

Adding what's special from this technique (and similarly seen in StarSpace) into some shared base would be preferable, but could be a bit harder to do. If it could be done on the existing code without too much overhead/complexity on existing interfaces, that'd be good. (Maybe a tight patch to existing code is possible, unsure without seeing it.) Or it could be a requirement included in some bigger re-build of the (anything)-2-vec classes.

shubham0704 commented 6 years ago

It is trending on kaggle for solving prediction problems. A list of problems dealt with this model is mentioned here. There is only one example I found so far with a clear explanation which I have mentioned above - link.A reference implementation has been given by Jeremy Howard for his fast.ai course here.

From gensim's perspective I wanted to explore what would happen if we could combine the feature spaces of structured and unstructured data. Intuitively, I thought mixing them on shared space would result in embeddings would help better capture the domain structure of a company/organisation better. This intuition maybe a bit far fetched for now because transfer learning has not found much success in NLP. I would further check its feasibility and let you all know. Thanks for all your comments. If you have any other thoughts on this please let me know.

piskvorky / gensim

Addition of entity embedding model #2006

The inspiration

Current Problems faced by Neural Networks