Closed MFajcik closed 1 year ago
Good catch! Can you cast the set to a list? That sounds like it'll fix this. If you make a pull request, I'll merge it.
@okhat Wouldn't it be better to return N deduplicated-lists (where N is number of hops) from COLBERT engine. So the individual retrieval results would have preserved order?
I would submit the pull-request for COLBERT, but I am not sure if this won't cause problems with some scripts you have.
edit:/ code-wise something like
from baleen.utils.loaders import *
from baleen.condenser.condense import Condenser
class Baleen:
def __init__(self, collectionX_path: str, searcher, condenser: Condenser):
self.collectionX = load_collectionX(collectionX_path)
self.searcher = searcher
self.condenser = condenser
def search(self, query, num_hops, depth=100, verbose=False):
assert depth % num_hops == 0, f"depth={depth} must be divisible by num_hops={num_hops}."
k = depth // num_hops
searcher = self.searcher
condenser = self.condenser
collectionX = self.collectionX
facts = []
stage1_preds = None
context = None
pids_bag = [[] for _ in range(num_hops)]
for hop_idx in range(0, num_hops):
ranking = list(zip(*searcher.search(query, context=context, k=depth)))
ranking_ = []
facts_pids = set([pid for pid, _ in facts])
for pid, rank, score in ranking:
# print(f'[{score}] \t\t {searcher.collection[pid]}')
if len(ranking_) < k and pid not in facts_pids:
ranking_.append(pid)
if len(pids_bag[hop_idx]) < k:
if all(pid not in pids_bag[hi] for hi in range(num_hops)):
pids_bag[hop_idx].append(pid)
stage1_preds, facts, stage2_L3x = condenser.condense(query, backs=facts, ranking=ranking_)
context = ' [SEP] '.join([collectionX.get((pid, sid), '') for pid, sid in facts])
assert sum(len(pids_per_hop) for pids_per_hop in pids_bag) == depth #//edit fixed assert
return stage2_L3x, pids_bag, stage1_preds
Hi, when saving the inference results as json file via
hover_inference.py
, the dictionary contains set. Sets are not serializable via json. Thus the saving fails.Every item in dictionary to be saved looks like this
This is quite annoying, when spending few hours inferring the actual retrieval results :). Cheers, Martin
environment