Questions about model evaluation

yaxundai commented 5 months ago

When I used the pre-trained model 'raphaelsty/neural-cherche-sparse-embed' to evaluate the dataset, specifically, the arguana dataset, with a retrieval k value of 100, the result was very poor {'map': 0.033567943638956016, 'ndcg@10': 0.042417859280348115, 'ndcg@100': 0.08691780846498275, 'recall@10': 0.09815078236130868, 'recall@100': 0.32147937411095306} As shown above, ndcg is only 4.2%

raphaelsty commented 5 months ago

Hi @KAGAII, make sure you update neural-cherche using pip install neural-cherche --upgrade to get the 1.4.3 version.

from neural_cherche import models, rank, retrieve, utils

device = "cpu" # or "mps" or "conda"

documents, queries, qrels = utils.load_beir(
    "arguana",
    split="test",
)

retriever = retrieve.BM25(
    key="id",
    on=["title", "text"],
)

ranker = rank.ColBERT(
    key="id",
    on=["title", "text"],
    model=models.ColBERT(
        model_name_or_path="raphaelsty/neural-cherche-colbert",
        device=device,
    ).to(device),
)

retriever = retriever.add(
    documents_embeddings=retriever.encode_documents(
        documents=documents,
    )
)

candidates = retriever(
    queries_embeddings=retriever.encode_queries(
        queries=queries,
    ),
    k=30,
    tqdm_bar=True,
)

batch_size = 32

scores = ranker(
    documents=candidates,
    queries_embeddings=ranker.encode_queries(
        queries=queries,
        batch_size=batch_size,
        tqdm_bar=True,
    ),
    documents_embeddings=ranker.encode_candidates_documents(
        candidates=candidates,
        documents=documents,
        batch_size=batch_size,
        tqdm_bar=True,
    ),
    k=10,
)

scores = utils.evaluate(
    scores=scores,
    qrels=qrels,
    queries=queries,
    metrics=["ndcg@10"] + [f"hits@{k}" for k in range(1, 11)],
)

print(scores)

Yield

{
    "ndcg@10": 0.3686831610778578,
    "hits@1": 0.01386748844375963,
    "hits@2": 0.27889060092449924,
    "hits@3": 0.40061633281972264,
    "hits@4": 0.4861325115562404,
    "hits@5": 0.5562403697996918,
    "hits@6": 0.6194144838212635,
    "hits@7": 0.6556240369799692,
    "hits@8": 0.6887519260400616,
    "hits@9": 0.7218798151001541,
    "hits@10": 0.74884437596302,
}

which are good scores, it run in 3 min on mps device. The results you get are due do duplicates queries which are now handled by the evaluation of neural-cherche.

EDIT: sorry I just saw you mention sparse embed a not colbert, running benchmark

raphaelsty commented 5 months ago

@KAGAII There is definitely something wrong with SparseEmbed right now, we recently updated SparseEmbed but we may need to update it back to the previous version @arthur-75. I'll make an update in the following days

yaxundai commented 5 months ago

Thank you for your prompt reply, looking forward to the new version!

raphaelsty / neural-cherche

Questions about model evaluation #32