Closed rssdev10 closed 3 years ago
Hi, I see that you use an old API. Please look at the documentation (only dev docs are working) to see the new API. I believe now is simpler, and you can use distances from Distances.jl
directly.
Please see the following example; it computes a knn graph for a word embedding.
https://github.com/sadit/SimilaritySearchExamples/tree/main/word-embeddings-knn-graph
It uses multithreading, so it is a bit more intricate than a single thread but should help. In the example, Vectors are normalized before being inserted and queried, but you can use a non-normalized distance too, just specifying CosineDistance() instead of NormalizedCosineDistance()
I can see that you use the Embeddings package while I read everything by "hand," so the entire process must be similar but with some small changes; please check the example, and let me know if you have any troubles.
Ok, thanks. This way is really work as expected:
import Embeddings: load_embeddings, FastText_Text
import WordTokenizers: tokenize
import SimilaritySearch: search, SearchGraph, CosineDistance, KnnResult, Item
ENV["DATADEPS_ALWAYS_ACCEPT"] = true
const embtable = load_embeddings(FastText_Text)
const embsize = size(embtable.embeddings)[1]
const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))
function get_embedding(word)
ind = get(get_word_index, word, -1)
return ind == -1 ? zeros(embsize) : embtable.embeddings[:,ind]
end
function vectorize(str::String)
local tokens = str |> tokenize
isempty(tokens) ? zeros(embsize) : mapreduce(get_embedding, +, tokens)
end
strings = [
"Java", "Java programmer", "Julia language", "Julia programmer",
"cpp", "programming", "Ruby", "Ada",
"carrot", "beet", "cucumber", "Cucumber"
];
db = strings .|> vectorize
point = db[begin]
@time index = SearchGraph(CosineDistance(), db; parallel=true, firstblock=50000, block=5000)
@time res = search(index, point, 10)
@info map(k -> (strings[k.id], k.dist), res) |>
l -> filter(((s, w),) -> w < 0.7, l)
If you need that sample, use it anywhere.
BTW, I prefer to save temporary data by this way - https://github.com/JuliaIO/JLD2.jl/issues/280#issuecomment-781357291
And finally, why don't you declare your package at https://discourse.julialang.org/ ? Or even make it as a part of https://github.com/JuliaNeighbors ?
Thank you, @rssdev10, for the example.
The use of JSON3 was a decision made to allow opening the structure from other environments, i.e., Python. Fortunately, JLD2 works, and it is great to select whatever works fine for a particular task.
Until some months ago, I really don't know too much about the Julia package ecosystem and how it is integrated into the community, and I didn't push in that way. Now that SimilaritySearch has a stable API, I am also trying to improve the documentation and create more examples such that other people can use it. I believe it is more usable now, and I will try to place the package in discourse and JuliaNeighbors.
Best regards, Eric
@sadit This looks like a good addition for JuliaNeighbors indeed. Feel free to contact the owners there.
Hi, is any way to use non normalized vectors and cosine distances for text embedding search?
The code sample based on FastText is here. Last fragment contains a simple cosine-distance-based filter.