Open cherry-1024 opened 5 years ago
Hi,
the paper is under progress, is there anything in particular I might help you with in the meantime?
Best regards, Samuel
On 14.11.2018 14.16, cherry-1024 wrote:
looking forward to your answer
thank you
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sronnqvist/doc2topic/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AJgDMDOgTSql369Rlr2b37oDDXpqOLPYks5uvAmcgaJpZM4YdoBv.
I want to know the structure of the network, so that I could understand the code easily.
If you could give some tips, I will appreciate for that.
Best wishes. Cherry
The network structure is inspired by word2vec skip-gram, where instead of modeling co-occurrences between center and context words, co-occurrences between a word and its document ID is modeled. In order to avoid heavy softmax calculation on an output layer the size of the vocabulary (or number of docs), the model is implemented as follows. The network takes as input a word ID and a document ID, and each is feed through a separate embedding layer. The embedding layers are L1 activity regularized in order to obtain sparse representations. Each dimension in the embedding represents a topic, and I want a sparse assignment of topics for each document. The document embeddings are more heavily regularized than the word embeddings, as sparsity is important for topic-document assignments, but document and word embeddings are supposed to be comparable.
The network is trained by negative sampling, i.e., for any document both actual co-occurring words and random (supposed non-co-occurring) words are feed to the network. The two embeddings are compared by dot product, and the result is pushed through a sigmoid activation function (values from 0 to 1). The training output label is 1 for co-occurring words and 0 for negative samples. This will push document vectors towards the vectors of the words of the document.
-Samuel
On 14.11.2018 15.43, cherry-1024 wrote:
I want to know the structure of the network, so that I could understand the code easily.
If you could give some tips, I will appreciate for that.
Best wishes. Cherry
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sronnqvist/doc2topic/issues/1#issuecomment-438665800, or mute the thread https://github.com/notifications/unsubscribe-auth/AJgDMIshta3eNvqotAriZBq_vhAamiWtks5uvB3tgaJpZM4YdoBv.
This idea is so wonderful. Thanks so much for your share.
Best wishes. Cherry
Hi - this is really great. But how do you handle unseen document ids? And the negative sampling is part of the pre-processing step? Would be wonderful to apply this model to millions of documents.
Hi!
Topic inference for unseen documents is not implemented yet, please see issue #2 for a workaround.
I have been working on moving the negative sampling to a data generator and do the sampling online, which should be better obviously. However, the results were quite miserable and I will have to look it over.
I did run the current code on 1M documents and 200 topics, which fit in the memory of a Titan Xp. The embeddings should be what consume most memory.
Hi @sronnqvist , I am also curious to hear news about your paper, both for reading it and for citing your work properly. There is any news about this?
looking forward to your answer
thank you