jina-ai / executors

internal-only
Apache License 2.0
31 stars 12 forks source link

refactor: specify startup arg #252

Closed cristianmtr closed 2 years ago

cristianmtr commented 2 years ago

Refactor usage of startup args, to be easier and adapt to scenario.

cristianmtr commented 2 years ago

I think I miss some use-case, but I still don't get, why we need startup at all. Couldn't we just do

        deltas = self._kv_indexer._get_delta(
            shard_id=self.runtime_args.pea_id,
            total_shards=self.total_shards,
            timestamp=timestamp,
        )
        # deltas will be like DOC_ID, OPERATION, DATA
        self._vec_indexer._add_delta(deltas)

in both cases inside _sync_only_delta? The FaissSearcher is already initialized, no reason to create a new one. Then timestamp can always default to self._vec_indexer.last_timestamp. This anyhow defaults to datetime.min.

The current semantic of startup is: Clear the FaissIndexer completely. If this is the use-case, we should rename the parameter to something like recreate. I'd argue, this should be done by recreate the full Pea for now.

I don't think we can easily add delta if the index is empty. That's why it needs to recreate the FaissSearcher and call it with the constructor's dump_func

maximilianwerk commented 2 years ago

I don't think we can easily add delta if the index is empty. That's why it needs to recreate the FaissSearcher and call it with the constructor's dump_func

I see. So from Faiss perspective, the difference is: startup = True => Training must happen. This means, the very sync call defines the training data. When I get it right, whenever I set startup=True, the FaissSearcher will be retrained. Thus, I'd rename it to something like train_faiss or just train/training.

Apart from that, we should be very explicit in the Readme, that the first call to sync implies a train step. If you never call sync with train=True afterwards, the timing of the very first call heavily influences your search performance afterwards. This should be included as a warning or note or something in the Readme (if it not already is).

cristianmtr commented 2 years ago

@maximilianwerk Renamed and adapted README