iamlemec / bert.cpp

GGML implementation of BERT model with Python bindings and quantization.
MIT License
51 stars 5 forks source link

does bert_encode() thread-safe for online embedding? #11

Open WayneCao opened 8 months ago

WayneCao commented 8 months ago

I found that different invocation shares same memory buffer in bert_context, it may not be thread-safe for online-embedding situation

iamlemec commented 8 months ago

Yup, that seems right. Good news is that we got merged into llama.cpp, which has multi-threading support. Check it out over there!

WayneCao commented 8 months ago

Yup, that seems right. Good news is that we got merged into llama.cpp, which has multi-threading support. Check it out over there!

Can you help explain the implementation mechanism?

iamlemec commented 8 months ago

Sure! The major difference from this one is the way that batching works. Here we have explicit batch sizes for each sequence, and so we need to pad them to alignment. In the llama.cpp implemenation, batches are essentially lists of (sequence_id, position, token_id) pairs, so you can put multiple sequences in one batch without padding, which can be really good for uneven length settings. The bulk of the new code there is in llama.cpp:build_bert() if you want to go into more detail.

Is that what you were looking for? Happy to provide more specifics.

WayneCao commented 8 months ago

Sure! The major difference from this one is the way that batching works. Here we have explicit batch sizes for each sequence, and so we need to pad them to alignment. In the llama.cpp implemenation, batches are essentially lists of (sequence_id, position, token_id) pairs, so you can put multiple sequences in one batch without padding, which can be really good for uneven length settings. The bulk of the new code there is in llama.cpp:build_bert() if you want to go into more detail.

thank you so much! This seems only support multi-thread in batch? Let me briefly state my question. I want to wrapper llama.cpp into a online-embedding service, when concurrent client request comes, llama.cpp:build_bert() seems not thread-safe between different invocations, i haven‘t figured out how to guarantee the memory-safety in llama_context for different invocations and found any read-write lock around build_bert?