sronnqvist / doc2topic

Neural topic modeling
29 stars 10 forks source link

Hi, is there a paper that supports your code? #1

Open cherry-1024 opened 5 years ago

cherry-1024 commented 5 years ago

looking forward to your answer

thank you

sronnqvist commented 5 years ago

Hi,

the paper is under progress, is there anything in particular I might help you with in the meantime?

Best regards, Samuel

On 14.11.2018 14.16, cherry-1024 wrote:

looking forward to your answer

thank you

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sronnqvist/doc2topic/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AJgDMDOgTSql369Rlr2b37oDDXpqOLPYks5uvAmcgaJpZM4YdoBv.

cherry-1024 commented 5 years ago

I want to know the structure of the network, so that I could understand the code easily.

If you could give some tips, I will appreciate for that.

Best wishes. Cherry

sronnqvist commented 5 years ago

The network structure is inspired by word2vec skip-gram, where instead of modeling co-occurrences between center and context words, co-occurrences between a word and its document ID is modeled. In order to avoid heavy softmax calculation on an output layer the size of the vocabulary (or number of docs), the model is implemented as follows. The network takes as input a word ID and a document ID, and each is feed through a separate embedding layer. The embedding layers are L1 activity regularized in order to obtain sparse representations. Each dimension in the embedding represents a topic, and I want a sparse assignment of topics for each document. The document embeddings are more heavily regularized than the word embeddings, as sparsity is important for topic-document assignments, but document and word embeddings are supposed to be comparable.

The network is trained by negative sampling, i.e., for any document both actual co-occurring words and random (supposed non-co-occurring) words are feed to the network. The two embeddings are compared by dot product, and the result is pushed through a sigmoid activation function (values from 0 to 1). The training output label is 1 for co-occurring words and 0 for negative samples. This will push document vectors towards the vectors of the words of the document.

-Samuel

On 14.11.2018 15.43, cherry-1024 wrote:

I want to know the structure of the network, so that I could understand the code easily.

If you could give some tips, I will appreciate for that.

Best wishes. Cherry

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sronnqvist/doc2topic/issues/1#issuecomment-438665800, or mute the thread https://github.com/notifications/unsubscribe-auth/AJgDMIshta3eNvqotAriZBq_vhAamiWtks5uvB3tgaJpZM4YdoBv.

cherry-1024 commented 5 years ago

This idea is so wonderful. Thanks so much for your share.

Best wishes. Cherry

systats commented 5 years ago

Hi - this is really great. But how do you handle unseen document ids? And the negative sampling is part of the pre-processing step? Would be wonderful to apply this model to millions of documents.

sronnqvist commented 5 years ago

Hi!

Topic inference for unseen documents is not implemented yet, please see issue #2 for a workaround.

I have been working on moving the negative sampling to a data generator and do the sampling online, which should be better obviously. However, the results were quite miserable and I will have to look it over.

I did run the current code on 1M documents and 200 topics, which fit in the memory of a Titan Xp. The embeddings should be what consume most memory.

pasqLisena commented 4 years ago

Hi @sronnqvist , I am also curious to hear news about your paper, both for reading it and for citing your work properly. There is any news about this?