Open cjnolet opened 3 years ago
number of tokens per row for string matching could very well be bounded from above by a few hundred to a few thousand
@cjnolet can you please elaborate on this? Does this mean the dimensionality of such datasets are a few hundred to thousand? Or are the number of non-zero elements per row are a few hundred to thousand?
and if it is the latter, do you know what's the typical dimensionality of such datasets?
@teju85,
The dimensionality (vocabulary) is expected to be high but will depend on the dataset and tokenization strategy. The number of nonzeros for any given row (e.g. the number of tokens) is expected to be bounded by a few hundred to a couple thousand (this will also depend on the tokenization strategy).
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Is this implemented?
Another need is to be able too have cosine similarity without l2 normalization option on KNN. This way one can approximate BM25 matching. Since long documents suffer from l2 normalization, I do custom normalization to approximate BM25 performance with TFIDF vectors.
CPU Version
from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
import pandas as pd
from sparse_dot_topn import sp_matmul, sp_matmul_topn
import numpy as np
from beir.retrieval.search import BaseSearch
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse as sp
corpus_df = pd.DataFrame([{"text_id": k, "text": v["text"]} for k, v in corpus.items()])
corpus, queries, qrels = GenericDataLoader(data_folder="data/fiqa/").load(split="test")
class StringMatcher(BaseSearch):
def __init__(self, ground_truth, col):
self.ground_truth = ground_truth
self.vec1 = TfidfVectorizer(ngram_range=(4, 4), analyzer="char", norm=None,
sublinear_tf=True, min_df=3, lowercase=True, use_idf=True)
self.vec2 = TfidfVectorizer(ngram_range=(1, 2), analyzer="word", norm=None,
sublinear_tf=True, min_df=1, lowercase=False, use_idf=True)
self.col = col
self.vec1.fit(self.ground_truth[self.col])
self.vec2.fit(self.ground_truth[self.col])
self.gt_ids = self.ground_truth["text_id"].values
def vectorize(self, texts):
V = sp.hstack([self.vec1.transform(texts),
self.vec2.transform(texts)])
norms = sp.linalg.norm(V, axis=1)
row_indices, col_indices = V.nonzero()
V.data /= norms[row_indices] + 100
return V
def get_matches_df(self, sparse_matrix, query_ids):
non_zeros = sparse_matrix.nonzero()
query_indices = non_zeros[0]
corpus_indices = non_zeros[1]
res = defaultdict(dict)
for index in range(corpus_indices.size):
query_id = query_ids[query_indices[index]]
corpus_id = self.gt_ids[corpus_indices[index]]
score = sparse_matrix.data[index]
res[query_id].update({corpus_id: score})
return res
def search(self, corpus, queries, topk, *args, **kwargs):
query_df = pd.DataFrame([{"text_id": k, "text": v} for k, v in queries.items()])
X_gt = self.vectorize(self.ground_truth[self.col])
X = self.vectorize(query_df[self.col])
d = sp_matmul_topn(X, X_gt.T, top_n=10, threshold=0.0, sort=True, n_threads=8)
return self.get_matches_df(d, query_df["text_id"].values)
sm = StringMatcher(corpus_df, "text")
retriever = EvaluateRetrieval(sm)
results = retriever.retrieve(corpus, queries)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
Diff on GPU Version
from cupyx.scipy.sparse import diags
from cuml.neighbors import NearestNeighbors
from tqdm import tqdm
class StringMatcher(BaseSearch):
def __init__(self, corpus_df, col):
self.corpus_df = corpus_df
self.vec = TfidfVectorizer(ngram_range=(1, 2), analyzer="word", norm=None,
sublinear_tf=True, min_df=1, lowercase=False, use_idf=True)
self.col = col
self.vec.fit(self.corpus_df[self.col])
self.corpus_ids = self.corpus_df["text_id"]
def vectorize(self, texts):
V = self.vec.transform(texts)
norms = sparse_linalg.norm(V, axis=1) + 100
norm_diag = diags(norms)
V = norm_diag @ V
return V
def search(self, corpus, queries, topk, *args, **kwargs):
query_df = cudf.DataFrame([{"text_id": k, "text": v} for k, v in queries.items()])
X_corpus = self.vectorize(self.corpus_df[self.col])
X_query = self.vectorize(query_df[self.col])
nn = NearestNeighbors(n_neighbors=10, metric="cosine")
nn.fit(X_corpus)
distances, indices = nn.kneighbors(X_query)
scores = 1 - distances
query_ids = query_df["text_id"]
res = defaultdict(dict)
for i in tqdm(range(scores.shape[0])):
for j in range(10):
res[query_ids[i]].update({self.corpus_ids[indices[i, j].item()]: scores[i, j].item()})
return res
sm = StringMatcher(corpus_df, "text")
One of the challenges with providing optimized primitives for sparse datasets is the large number of different degree distributions that can be encountered in practice. In some recent discussions with @teju85 @beckernick @VibhuJawa @aerdem4 and @randerzander, we have isolated some potential upper bounds specific to the string matching use-case which can offer some additional memory savings (and potential performance boost) by performing the k-select in the same kernel that's computing distances.
What has been discussed so far is that the number of tokens per row for string matching could very well be bounded from above by a few hundred to a few thousand. For metrics which can be
we only require a single pass through the distance kernel. The lower the max degree, the better, as the example dataset being used in this blog is bounded from above by 138, which means we could lower the amount of shared memory being used, thus also being able to maintain high occupancy even with a lower block size in order to gain more registers for performing the k-select.
In addition, we have discussed that the value of
k
should be fairly small (definitely smaller than 32).It sounds like this is a common enough problem where this approach could really benefit, and would reduce the need to batch over the pairwise distance matrix by allowing all of the distances to be computed at once.