minimaxir / char-embeddings

A repository containing 300D character embeddings derived from the GloVe 840B/300D dataset, and uses these embeddings to train a deep learning model to generate Magic: The Gathering cards using Keras
MIT License
214 stars 141 forks source link

Question on deriving char embeddings from word embeddings. #2

Open queirozfcom opened 7 years ago

queirozfcom commented 7 years ago

Hi. I was looking at create_embeddings.py to see how you derived char embeddings directly from word embeddings.

It looks like you equate a char embedding with the average of all word vectors that contain that char, counting each char multiple times if it occurs more than once in a word. Is that correct?

Did you decide to do this because you got good results or was there some other reason for this?

Thanks!

FA

minimaxir commented 7 years ago

Yes, that implementation description is correct.

The "why" is detailed in the blog post.

brandonrobertz commented 6 years ago

The blog post very briefly goes into why you're averaging the word vectors to get character vectors. Are you aware of any rigorous comparison between these average derived char vectors vs true learned char vectors?

My intuition is that these char vectors will be poor approximations as they're not distinguishing between neighboring characters and ones that appear far from the char since they're all treated equally.

fermat97 commented 5 years ago

I also have the same question, @queirozfcom @brandonrobertz did you also try this method and get some good result?

brandonrobertz commented 5 years ago

@fermat97 It's my opinion, now, after trying this method and also just building my own character vectors that this method is a very poor approximation. You're throwing away a lot of distance-related character information (which is important for character embeddings). It's quite easy to train character embeddings on even giant datasets so I suggest just doing that.

fermat97 commented 5 years ago

@brandonrobertz thanks a lot, I have also tried and the result was poor.