Closed thinline72 closed 5 years ago
Actually, I haven't found mention that F vector should be normalized, so probably there is no restriction here and we can just compute dot product + sign.
BTW, paper that inspired authors contains much more details regarding projection functions: https://arxiv.org/abs/1708.00630
Looks like this RandomBinaryProjections impl might be used: https://github.com/pixelogik/NearPy/blob/master/nearpy/hashes/randombinaryprojections.py
@thinline72 Thanks for the link! :tada: Will read soon!
Yet, I'm hesitant on this:
it seems like they used the whole text on character-lvl, not word-lvl
They mention skip-gram. Could it be that they did like I do, but that the regression goal was formulated like word2vec's skip-gram (training with negative sampling) with some "nearest words v.s. random/far words"? Just throwing ideas here. I wonder how they use the "trainer network" in the paper you just linked (didn't read it yet). Because they can't sample the words back to their original form, they could be training directly on a softmax on cosine similarity (as an attention mechanism), like what's done in Matching Networks.
For now I'm somehow satisfied with the code I've written, but I'd still like to fully understand the original paper. My goal is to use this word-level "word projector" to use that within a self-attention mechanism (like the Transformer in AIAYN), and then to perform a sentence-level skip-gram training with a Matching-Network-like regression head. When I'll be done with this, I'll upload the big neural net to this repo (and perhaps then rename the repo), for now the repo is still private.
Hi @guillaume-chevalier , Thanks for the response and for the links, I really like an idea in Matching Networks one. I'd definitely play with it.
it seems like they used the whole text on character-lvl, not word-lvl They mention skip-gram
Sorry, looks like I explained it incorrectly. I actually meant text-lvl, like in classical approaches. So there would not be a sequence of tokens/words/vectors, but just one sparse vector for the whole text example.
I also think that for skipgrams they used old classic ones like this implementation in NLTK.
So it looks like their preprocessing and architecture might be reproduced by using the following steps:
So the workflow looks pretty similar to something well-known like TFIDFVecotrizer + SVD + NN, but just adjusted to the restrictions of mobile devices.
Very interesting. I've finally read the previous supporting paper, thanks for the shootout. Here are my thoughts after reading it. To sum up, I now think that the projections are really at word-level instead of at sentence level. This is for two reasons:
Here is my full review:
On 80*14 v.s. 1*1120 projections:
On the hidden layer size of 256:
On the benchmark against a nested RNN (see section "4.3 Semantic Intent Classification") in the previous supporting paper:
On teacher-student model training:
Hi @guillaume-chevalier ,
Sorry, I haven't read in deep the supporting paper. I only checked how Randomized Binary Projections are implemented. So yeah, they might worked on words or chars level and it totally makes sense to do it. My point was that in Self-Governing Neural Networks for On-Device Short Text Classification
paper it's text level. It's clear from the description and architecture (Figure 1).
On 8014 v.s. 11120 projections:
Based on Randomized Binary Projections creating 80 random matrices of size 14inp_dim (and concatenating results after applying) or just one random matrix of size 1120inp_dim - are the same things. After all it's just about having 8014inp_dim or 1120inp_dim random numbers. But the latest one would require much more memory to compute hashing (as it'd involve multiplication of much bigger matrix) rather than 80 computations of smaller matrices. So 8014 is more appropriate for devices with limited RAM. That's why I think they used 80*14.
Reading your reply and the SGNN paper again, you seem right about the 80*14 thing. Next thing would be to understand exactly how they did the skip-gram, but from what I read I don't have enough information yet. For now, at least I'm fine with using n-grams, I somehow like it that way although I won't really compare n-gram v.s. skip-gram, at least in the short term. I also still have doubts on my usage of random projections, I hope it works similarly enough to the one you linked despite the fact I don't obtain booleans. Handn't found the time to do detailed readings on LSH, althought I liked this video (and the next one titled #9 in the playlist), and the fact that such sparse random hashing preserves the cosine similarity in a lower-dim space.
Overall, thank you for the discussion - I'm glad you linked me the previous supporting paper ProjectionNet and that we exchanged on those!
Just got an idea.
Instead of using a Matching Networks loss, I'll use a Cosine Proximity loss which is simpler.
I'd compute the similarity of the B sentences to each other such as to have a BxB matrix of similarities between -1 and 1. The result would be a Block Diagonal Matrix in which every block would have inner values of 1, and elements nearby not in a block would take value 0. I'd have blocks because this would be skip-gram. But I'd compare multiple windows (blocks) of sentences to other blocks instead of having the regular skip-gram block-v.s.-nothing. So I'd have a blocks v.s. blocks. training setting. What do you think @thinline72 ?
One thing that bothers me is that the cosine similarities aren't passed through a softmax or something. For example I could take a binary crossentropy loss instead of a cosine proximity, but I don't know how that would compare. I feel that a cosine poximity wouldn't perform well. I then read thread which validates my thinking. Would you have a loss to suggest?
Hi @guillaume-chevalier ,
I think this paper Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs might be useful for your, although I'm not 100% sure 🙂
Hi again @thinline72, I tested with only 1 hasher and things were quite faster. See: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/issues/2
That's an interesting paper you just linked. However for the time being, I finally settled on a binary cross-entropy loss (bxent) on a block diagonal matrix of similarity. Because the similarity is from -1 to 1, I remap it from 0 to 1 before sending to the bxent loss (+=1, then /=2).
The whole code for the SGNN+Transformer+SimilarityBXENT will soon become public here: https://github.com/guillaume-chevalier/NLP-TP3
This is an amazing work, well done! And thanks for mentioning me as well.
So I think this issue might be closed. Feel free to reopen or contact me via email if you want to discuss anything else 🙂
@thinline72 Hey, I just found this out, they also explore the diagonal matrix, here it's for every word of a sentence related to each other, and also they explore learning at different levels: https://arxiv.org/pdf/1808.08949.pdf This paper is cited in the abstract of BERT, which is special to do in an abstract I think.
Great! Thank you for sharing @guillaume-chevalier
Thanks for your discussion guys, I gave some comments as a response to the original question here: https://stackoverflow.com/questions/53876974/how-is-hashing-implemented-in-sgnn-self-governing-neural-networks/65546274#65546274
Hi Guillaume,
I'm also pretty exciting of this paper and approached suggested. I've went through your implementation as well as your questions on Stackoverflow: https://stackoverflow.com/questions/53876974/how-is-hashing-implemented-in-sgnn-self-governing-neural-networks
Sorry, I don't have enough points to comment your Stackoverflow question and I don't have right answers, I just want to discuss it.
For me it seems like they used the whole text on character-lvl, not word-lvl. So for each example it is one vector of features that then is passed to projections. Although, you implementation is quite interesting as well because it still allows to use RNN/CNNs and keeps skip/ngrams ordering.
If I understand correctly paper, binary output comes from sign of dot product of feature (F) vector and generated projection (P) vector of feature in F. Assuming, that both vectors are normalized (unit length and they pointed that LSH produces unit vectors), dot product would be in range of {-1, 1} (cosine similarity) and sign of it would give us binary output.
What do you think about it? Does my understanding make sense?
I'm still not quite sure how those P vectors are obtained via LSH though.