pommedeterresautee / fastrtext

R wrapper for fastText
https://pommedeterresautee.github.io/fastrtext/
Other
101 stars 15 forks source link

Incorrect values for sentences from get_word_distance and get_nn #30

Closed lcampanelli closed 5 years ago

lcampanelli commented 5 years ago

Hello, first of all, thank you for this package. I’m interested in cosine similarities between sentences or between word and sentences. The following code I believe produces correct results:

pv <- get_sentence_representation(mod, c("she was", "and to") )
pv <- t(pv)

# using lsa package
lsa::cosine(pv)

# manual
v1 <- as.numeric(pv[,1])
v2 <- as.numeric(pv[,2])
sum(v1*v2) / ( sqrt(sum(v1*v1)) * sqrt(sum(v2*v2)) )

The manual way and lsa produce the same results. However, I obtain different results if I try to use get_word_distance (same similarity score than get_nn): 1 - get_word_distance(mod, "she was", "and to")

Is it correct that get_word_distance does not work with sentences? If so, it would be very helpful to get an error message instead of some value.

Thank you, Luca

pommedeterresautee commented 5 years ago

Hi, word distance is to compute... distance between 2 words. Tokenization is done by you the way you want. Having one or more space is not an issue by itself if it makes sense for your tokenization (may be you have multi word expressions, etc). My point is that you may want to take "she wants" as a unique token. Fast text will be able get a vectorial representation because it can do it for any OOV (it s based on qgram). So it works as intended.