skyl / corpora

Corpora is a self-building corpus that can help build other arbitrary corpora
GNU Affero General Public License v3.0
2 stars 0 forks source link

Experiment with ColBERTv2.0 for Embedding Comparison #25

Open skyl opened 1 week ago

skyl commented 1 week ago

Objective

Explore the efficacy of ColBERTv2.0 from Hugging Face against the current embedding methods used in our project. This is an initial experiment to understand how ColBERTv2.0 compares in terms of search accuracy, speed, and storage requirements.

Background

Currently, our project utilizes standard embeddings, which may not fully leverage the token-level representations offered by models like ColBERTv2.0. ColBERT, or Contextualized Late Interaction over BERT, promises enhanced representation by generating multiple vectors per document, representing token-level or segment-level semantic information.

Plan

  1. Setup: Install and configure ColBERTv2.0 from Hugging Face.
  2. Integration:
    • Update the existing embedding generation pipeline to incorporate ColBERTv2.0 as an alternative.
    • Use pgvector with Django to store multi-vector representations.
  3. Comparison:
    • Implement search functionality using both the new ColBERT-based embeddings and the current method.
    • Compare both methods based on retrieval accuracy, processing time, and database storage utilization.
  4. Evaluation:
    • Analyze the outcomes of both methods and document observations regarding their effectiveness and practicality for our needs.

Expected Outcome

Additional Notes