sadit / SimilaritySearch.jl

A nearest neighbor search library with exact and approximate algorithms
https://sadit.github.io/SimilaritySearch.jl/
MIT License
42 stars 7 forks source link

cosine distance for embeddings #8

Closed rssdev10 closed 3 years ago

rssdev10 commented 3 years ago

Hi, is any way to use non normalized vectors and cosine distances for text embedding search?

The code sample based on FastText is here. Last fragment contains a simple cosine-distance-based filter.

using Embeddings
using WordTokenizers
import Distances
import SimilaritySearch: fit, search, Sequential, SearchGraph, cosine_distance, KnnResult, Item

ENV["DATADEPS_ALWAYS_ACCEPT"] = true

const embtable = load_embeddings(FastText_Text)
const embsize = size(embtable.embeddings)[1]

const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))

function get_embedding(word)
  ind = get(get_word_index, word, -1)
  return ind == -1 ? zeros(embsize) : embtable.embeddings[:,ind]
end

function vectorize(str::String)
  local tokens = str |> tokenize
  isempty(tokens) ? zeros(embsize) : mapreduce(get_embedding, +, tokens)
end

strings = [
    "Java", "Java programmer", "Julia language", "Julia programmer",
    "cpp", "programming", "Ruby", "Ada",
    "carrot", "beet", "cucumber", "Cucumber" 
];

db = strings .|> vectorize

# change here the index to find similarities
point = db[end - 1]

@time seqindex = fit(Sequential, db);
@time res = search(seqindex, cosine_distance, point, KnnResult(10))
@info map(k ->  (strings[k.id], k.dist),  res)

@time graph = fit(SearchGraph, cosine_distance, db);
@time res = search(graph, cosine_distance, point, KnnResult(10))
@info map(k ->  (strings[k.id], k.dist),  res)

# expected to see
@info map(v -> Distances.cosine_dist(point, v), db) |>
    weights -> zip(weights, 1:length(weights), weights .< 0.7) |> 
    collect |>
    l -> filter(((w, i, f),) -> f > 0, l) |>
    l -> map(((w, i, f),) -> (strings[i], w), l)
sadit commented 3 years ago

Hi, I see that you use an old API. Please look at the documentation (only dev docs are working) to see the new API. I believe now is simpler, and you can use distances from Distances.jl directly.

Please see the following example; it computes a knn graph for a word embedding.

https://github.com/sadit/SimilaritySearchExamples/tree/main/word-embeddings-knn-graph

It uses multithreading, so it is a bit more intricate than a single thread but should help. In the example, Vectors are normalized before being inserted and queried, but you can use a non-normalized distance too, just specifying CosineDistance() instead of NormalizedCosineDistance()

I can see that you use the Embeddings package while I read everything by "hand," so the entire process must be similar but with some small changes; please check the example, and let me know if you have any troubles.

rssdev10 commented 3 years ago

Ok, thanks. This way is really work as expected:

import Embeddings: load_embeddings, FastText_Text
import WordTokenizers: tokenize
import SimilaritySearch: search, SearchGraph, CosineDistance, KnnResult, Item

ENV["DATADEPS_ALWAYS_ACCEPT"] = true

const embtable = load_embeddings(FastText_Text)
const embsize = size(embtable.embeddings)[1]

const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))

function get_embedding(word)
  ind = get(get_word_index, word, -1)
  return ind == -1 ? zeros(embsize) : embtable.embeddings[:,ind]
end

function vectorize(str::String)
  local tokens = str |> tokenize
  isempty(tokens) ? zeros(embsize) : mapreduce(get_embedding, +, tokens)
end

strings = [
    "Java", "Java programmer", "Julia language", "Julia programmer",
    "cpp", "programming", "Ruby", "Ada",
    "carrot", "beet", "cucumber", "Cucumber" 
];

db = strings .|> vectorize
point = db[begin]

@time index = SearchGraph(CosineDistance(), db; parallel=true, firstblock=50000, block=5000)

@time res = search(index, point, 10)
@info map(k ->  (strings[k.id], k.dist),  res) |>
      l -> filter(((s, w),) -> w < 0.7, l)

If you need that sample, use it anywhere.

BTW, I prefer to save temporary data by this way - https://github.com/JuliaIO/JLD2.jl/issues/280#issuecomment-781357291

And finally, why don't you declare your package at https://discourse.julialang.org/ ? Or even make it as a part of https://github.com/JuliaNeighbors ?

sadit commented 3 years ago

Thank you, @rssdev10, for the example.

The use of JSON3 was a decision made to allow opening the structure from other environments, i.e., Python. Fortunately, JLD2 works, and it is great to select whatever works fine for a particular task.

Until some months ago, I really don't know too much about the Julia package ecosystem and how it is integrated into the community, and I didn't push in that way. Now that SimilaritySearch has a stable API, I am also trying to improve the documentation and create more examples such that other people can use it. I believe it is more usable now, and I will try to place the package in discourse and JuliaNeighbors.

Best regards, Eric

zgornel commented 3 years ago

@sadit This looks like a good addition for JuliaNeighbors indeed. Feel free to contact the owners there.