Pre-filtering the documents based on metadata before late-interaction

stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

MIT License

2.67k stars 355 forks source link

Pre-filtering the documents based on metadata before late-interaction #304

Open Athe-kunal opened 4 months ago

Athe-kunal commented 4 months ago

I have financial data for each quarter and yearly data, and in my application, I am asking users about which document they want to use a chatbot. Hence, I need to filter the data and then I will be using Colbert to answer the question. But how can I pre-filter data (like in a vector database)? Building indexes for each separate file is not optimal as I have other metadata.

Athe-kunal commented 4 months ago

@okhat Can you suggest something here?

detaos commented 4 months ago

It's not the easiest thing to use, but ColBERT does support pre-filtering:

Here's the chunk I use:

    if len(query.conditions) > 0:
        results = searcher.search(query.query, k=query.k, filter_fn=lambda pids: torch.tensor(
            [index for index in pids.numpy().tolist() if keepResult(query, index)], dtype=pids.dtype))
    else:
        results = searcher.search(query.query, k=query.k, full_length_search=True)

Note: The query object contains the filter conditions. The keepResult function returns a boolean about whether the metadata for the given passage id (index parameter) matches with the filter in the query.

Athe-kunal commented 4 months ago

Hi @detaos

Thank you for your response During indexing, how should I index it with metadata? My indexing function is something like

with Run().context(RunConfig(nranks=1, experiment=EXPERIMENT_NAME)):  # nranks specifies the number of GPUs to use
        config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                    # Consider larger numbers for small datasets.

        indexer = Indexer(checkpoint=COLBERT_CHECKPOINT, config=config)
        for name, text_list in texts_dict.items():
            index_name = f'SEC.Earningcalls.{ticker}.{year}.{name}.{nbits}bits'
            indexer.index(name=index_name, collection=text_list, overwrite=True)

How can I pass the metadata information, currently I am just passing the list of texts. Thanks in advance

detaos commented 4 months ago

You don't need to index metadata that won't help the search. For example, lastmod dates from HTML pages are useful metadata, but no one is searching for a lastmod date. So, I keep my non-search-related metadata separate. I have a mapping object from passage ID to page ID, then have a metadata object that has the metadata for each page. My keepResult function uses the mapping for the candidate passage ID to the page ID to get the page's metadata to check against the filter. Essentially: metadata[page_ids[passage_id]]

It's a bit convoluted, but if you store the metadata for each passage, then you end up with a LOT of redundant metadata (presupposing you have many pages that are longer than 1 passage, which I do).

Athe-kunal commented 4 months ago

Ok, understood Thanks @detaos, will implement this in my code