Open seetimee opened 2 months ago
To accelerate the bge-m3 sparse embedding module during inference, consider the following steps:
Optimize Model Loading:
select_model.py
.Batch Processing:
BatchHandler
initialization in engine.py
.batch_delay
to balance between latency and throughput.Caching:
batch_handler.py
to reuse embeddings.Hardware Acceleration:
transformer/embedder/neuron.py
for Neuron-specific optimizations.Parallel Processing:
BatchHandler
and ModelWorker
classes.Example Code Adjustments:
# engine.py
self._batch_handler = BatchHandler(
max_batch_size=self._engine_args.batch_size * 2, # Increase batch size
model=self._model,
batch_delay=self._min_inference_t / 4, # Adjust batch delay
vector_disk_cache_path=self._engine_args.vector_disk_cache_path,
verbose=logger.level <= 10,
lengths_via_tokenize=self._engine_args.lengths_via_tokenize,
)
# batch_handler.py
class BatchHandler:
def __init__(self, ...):
...
self._threadpool = ThreadPoolExecutor(max_workers=8) # Increase thread pool size
...
These changes should help in accelerating the sparse embedding process for the bge-m3 model.
/libs/infinity_emb/infinity_emb/engine.py /libs/infinity_emb/infinity_emb/inference/select_model.py /libs/infinity_emb/infinity_emb/inference/batch_handler.py /libs/infinity_emb/infinity_emb/inference /libs/infinity_emb/infinity_emb/transformer/embedder
These methods are difficult to work.
Feature request
how to accelerate bge m3 sparse embeding module when inference?
Motivation
the sparse embeding process is too slow during infer bge-m3 after accelerate the dense emb inference
Your contribution
you can give a idea,I will learn how to make it work