Closed cristianmtr closed 2 years ago
I think I miss some use-case, but I still don't get, why we need
startup
at all. Couldn't we just dodeltas = self._kv_indexer._get_delta( shard_id=self.runtime_args.pea_id, total_shards=self.total_shards, timestamp=timestamp, ) # deltas will be like DOC_ID, OPERATION, DATA self._vec_indexer._add_delta(deltas)
in both cases inside
_sync_only_delta
? TheFaissSearcher
is already initialized, no reason to create a new one. Thentimestamp
can always default toself._vec_indexer.last_timestamp
. This anyhow defaults todatetime.min
.The current semantic of
startup
is: Clear the FaissIndexer completely. If this is the use-case, we should rename the parameter to something likerecreate
. I'd argue, this should be done by recreate the full Pea for now.
I don't think we can easily add delta if the index is empty. That's why it needs to recreate the FaissSearcher and call it with the constructor's dump_func
I don't think we can easily add delta if the index is empty. That's why it needs to recreate the FaissSearcher and call it with the constructor's dump_func
I see. So from Faiss perspective, the difference is: startup = True
=> Training must happen. This means, the very sync
call defines the training data. When I get it right, whenever I set startup=True
, the FaissSearcher will be retrained. Thus, I'd rename it to something like train_faiss
or just train
/training
.
Apart from that, we should be very explicit in the Readme, that the first call to sync
implies a train step. If you never call sync
with train=True
afterwards, the timing of the very first call heavily influences your search performance afterwards. This should be included as a warning
or note
or something in the Readme (if it not already is).
@maximilianwerk Renamed and adapted README
Refactor usage of startup args, to be easier and adapt to scenario.