rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.19k stars 527 forks source link

[FEA] In-place k-select for sparse knn on string matching #3577

Open cjnolet opened 3 years ago

cjnolet commented 3 years ago

One of the challenges with providing optimized primitives for sparse datasets is the large number of different degree distributions that can be encountered in practice. In some recent discussions with @teju85 @beckernick @VibhuJawa @aerdem4 and @randerzander, we have isolated some potential upper bounds specific to the string matching use-case which can offer some additional memory savings (and potential performance boost) by performing the k-select in the same kernel that's computing distances.

What has been discussed so far is that the number of tokens per row for string matching could very well be bounded from above by a few hundred to a few thousand. For metrics which can be

  1. implemented in expanded form (such as cosine), and
  2. where the max degree of any row can fit into shared memory in sparse form,

we only require a single pass through the distance kernel. The lower the max degree, the better, as the example dataset being used in this blog is bounded from above by 138, which means we could lower the amount of shared memory being used, thus also being able to maintain high occupancy even with a lower block size in order to gain more registers for performing the k-select.

In addition, we have discussed that the value of k should be fairly small (definitely smaller than 32).

It sounds like this is a common enough problem where this approach could really benefit, and would reduce the need to batch over the pairwise distance matrix by allowing all of the distances to be computed at once.

teju85 commented 3 years ago

number of tokens per row for string matching could very well be bounded from above by a few hundred to a few thousand

@cjnolet can you please elaborate on this? Does this mean the dimensionality of such datasets are a few hundred to thousand? Or are the number of non-zero elements per row are a few hundred to thousand?

teju85 commented 3 years ago

and if it is the latter, do you know what's the typical dimensionality of such datasets?

cjnolet commented 3 years ago

@teju85,

The dimensionality (vocabulary) is expected to be high but will depend on the dataset and tokenization strategy. The number of nonzeros for any given row (e.g. the number of tokens) is expected to be bounded by a few hundred to a couple thousand (this will also depend on the tokenization strategy).

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

aerdem4 commented 4 months ago

Is this implemented?

Another need is to be able too have cosine similarity without l2 normalization option on KNN. This way one can approximate BM25 matching. Since long documents suffer from l2 normalization, I do custom normalization to approximate BM25 performance with TFIDF vectors.

aerdem4 commented 4 months ago

CPU Version

from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
import pandas as pd
from sparse_dot_topn import sp_matmul, sp_matmul_topn
import numpy as np
from beir.retrieval.search import BaseSearch
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sp

corpus_df = pd.DataFrame([{"text_id": k, "text": v["text"]} for k, v in corpus.items()])

corpus, queries, qrels = GenericDataLoader(data_folder="data/fiqa/").load(split="test")

class StringMatcher(BaseSearch):
    def __init__(self, ground_truth, col):
        self.ground_truth = ground_truth
        self.vec1 = TfidfVectorizer(ngram_range=(4, 4), analyzer="char", norm=None,
                                    sublinear_tf=True, min_df=3, lowercase=True, use_idf=True)
        self.vec2 = TfidfVectorizer(ngram_range=(1, 2), analyzer="word", norm=None,
                                    sublinear_tf=True, min_df=1, lowercase=False, use_idf=True)
        self.col = col

        self.vec1.fit(self.ground_truth[self.col])
        self.vec2.fit(self.ground_truth[self.col])
        self.gt_ids = self.ground_truth["text_id"].values

    def vectorize(self, texts):
        V = sp.hstack([self.vec1.transform(texts),
                    self.vec2.transform(texts)])
        norms = sp.linalg.norm(V, axis=1)

        row_indices, col_indices = V.nonzero()
        V.data /= norms[row_indices] + 100
        return V

    def get_matches_df(self, sparse_matrix, query_ids):
        non_zeros = sparse_matrix.nonzero()

        query_indices = non_zeros[0]
        corpus_indices = non_zeros[1]

        res = defaultdict(dict)

        for index in range(corpus_indices.size):
            query_id = query_ids[query_indices[index]]
            corpus_id = self.gt_ids[corpus_indices[index]]
            score = sparse_matrix.data[index]

            res[query_id].update({corpus_id: score})

        return res

    def search(self, corpus, queries, topk, *args, **kwargs):
        query_df = pd.DataFrame([{"text_id": k, "text": v} for k, v in queries.items()])

        X_gt = self.vectorize(self.ground_truth[self.col])
        X = self.vectorize(query_df[self.col])

        d = sp_matmul_topn(X, X_gt.T, top_n=10, threshold=0.0, sort=True, n_threads=8)

        return self.get_matches_df(d, query_df["text_id"].values)

sm = StringMatcher(corpus_df, "text")

retriever = EvaluateRetrieval(sm)

results = retriever.retrieve(corpus, queries)

ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

Diff on GPU Version

from cupyx.scipy.sparse import diags
from cuml.neighbors import NearestNeighbors
from tqdm import tqdm

class StringMatcher(BaseSearch):
    def __init__(self, corpus_df, col):
        self.corpus_df = corpus_df
        self.vec = TfidfVectorizer(ngram_range=(1, 2), analyzer="word", norm=None,
                                    sublinear_tf=True, min_df=1, lowercase=False, use_idf=True)
        self.col = col

        self.vec.fit(self.corpus_df[self.col])
        self.corpus_ids = self.corpus_df["text_id"]

    def vectorize(self, texts):
        V = self.vec.transform(texts)
        norms = sparse_linalg.norm(V, axis=1) + 100
        norm_diag = diags(norms)
        V = norm_diag @ V
        return V

    def search(self, corpus, queries, topk, *args, **kwargs):
        query_df = cudf.DataFrame([{"text_id": k, "text": v} for k, v in queries.items()])

        X_corpus = self.vectorize(self.corpus_df[self.col])
        X_query = self.vectorize(query_df[self.col])

        nn = NearestNeighbors(n_neighbors=10, metric="cosine")
        nn.fit(X_corpus)

        distances, indices = nn.kneighbors(X_query)
        scores = 1 - distances

        query_ids = query_df["text_id"]
        res = defaultdict(dict)

        for i in tqdm(range(scores.shape[0])):
            for j in range(10):
                res[query_ids[i]].update({self.corpus_ids[indices[i, j].item()]: scores[i, j].item()})

        return res

sm = StringMatcher(corpus_df, "text")