Open Athe-kunal opened 4 months ago
@okhat Can you suggest something here?
It's not the easiest thing to use, but ColBERT does support pre-filtering:
Here's the chunk I use:
if len(query.conditions) > 0:
results = searcher.search(query.query, k=query.k, filter_fn=lambda pids: torch.tensor(
[index for index in pids.numpy().tolist() if keepResult(query, index)], dtype=pids.dtype))
else:
results = searcher.search(query.query, k=query.k, full_length_search=True)
Note: The query
object contains the filter conditions. The keepResult
function returns a boolean about whether the metadata for the given passage id (index
parameter) matches with the filter in the query
.
Hi @detaos
Thank you for your response During indexing, how should I index it with metadata? My indexing function is something like
with Run().context(RunConfig(nranks=1, experiment=EXPERIMENT_NAME)): # nranks specifies the number of GPUs to use
config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
# Consider larger numbers for small datasets.
indexer = Indexer(checkpoint=COLBERT_CHECKPOINT, config=config)
for name, text_list in texts_dict.items():
index_name = f'SEC.Earningcalls.{ticker}.{year}.{name}.{nbits}bits'
indexer.index(name=index_name, collection=text_list, overwrite=True)
How can I pass the metadata information, currently I am just passing the list of texts. Thanks in advance
You don't need to index metadata that won't help the search. For example, lastmod
dates from HTML pages are useful metadata, but no one is searching for a lastmod
date. So, I keep my non-search-related metadata separate. I have a mapping object from passage ID to page ID, then have a metadata object that has the metadata for each page. My keepResult
function uses the mapping for the candidate passage ID to the page ID to get the page's metadata to check against the filter. Essentially: metadata[page_ids[passage_id]]
It's a bit convoluted, but if you store the metadata for each passage, then you end up with a LOT of redundant metadata (presupposing you have many pages that are longer than 1 passage, which I do).
Ok, understood Thanks @detaos, will implement this in my code
I have financial data for each quarter and yearly data, and in my application, I am asking users about which document they want to use a chatbot. Hence, I need to filter the data and then I will be using Colbert to answer the question. But how can I pre-filter data (like in a vector database)? Building indexes for each separate file is not optimal as I have other metadata.