This feature allows clip-client users to send a request to a server and then process the response with a custom callback function. There are three callbacks that users can process with custom functions: on_done, on_error and on_always.
The following code snippet shows how to send a request to a server and save the response to a database.
from clip_client import Client
db = {}
def my_on_done(resp):
for doc in resp.docs:
db[doc.id] = doc
def my_on_error(resp):
with open('error.log', 'a') as f:
f.write(resp)
def my_on_always(resp):
print(f'{len(resp.docs)} docs processed')
c = Client('grpc://0.0.0.0:12345')
c.encode(
['hello', 'world'], on_done=my_on_done, on_error=my_on_error, on_always=my_on_always
)
We have integrated the flash attention module as a faster replacement for nn.MultiHeadAttention. To take advantage of this feature, you will need to install the flash attention module manually:
If flash attention is present, clip_server will automatically try to use it.
The table below compares CLIP performance with and without the flash attention module. We conducted all tests on a Tesla T4 GPU, and times how long it took to encode a batch of documents 100 times.
Model
Input data
Input shape
w/o flash attention
flash attention
Speedup
ViT-B-32
text
(1, 77)
0.42692
0.37867
1.1274
ViT-B-32
text
(8, 77)
0.48738
0.45324
1.0753
ViT-B-32
text
(16, 77)
0.4764
0.44315
1.07502
ViT-B-32
image
(1, 3, 224, 224)
0.4349
0.40392
1.0767
ViT-B-32
image
(8, 3, 224, 224)
0.47367
0.45316
1.04527
ViT-B-32
image
(16, 3, 224, 224)
0.51586
0.50555
1.0204
Based on our experiments, performance improvements vary depending on the model and GPU, but in general, the flash attention module improves performance.
π Bug Fixes
Increase timeout at startup for Executor docker images (#854)
During Executor initialization, it can take quite a lot of time to download model parameters. If a model is very large and downloading slowly, the Executor may fail due to time-out before even starting. We have increased the timeout to 3000000ms.
Install transformers for Executor docker images (#851)
We have added the transformers package to Executor docker images, in order to support the multilingual CLIP model.
Release Note
This release contains 1 new feature, 1 performance improvement, 2 bug fixes and 4 documentation improvements.
π Features
Allow custom callback in
clip_client
(#849)This feature allows
clip-client
users to send a request to a server and then process the response with a custom callback function. There are three callbacks that users can process with custom functions:on_done
,on_error
andon_always
.The following code snippet shows how to send a request to a server and save the response to a database.
For more details, please refer to the CLIP client documentation.
π Performance
Integrate flash attention (#853)
We have integrated the flash attention module as a faster replacement for
nn.MultiHeadAttention
. To take advantage of this feature, you will need to install the flash attention module manually:If flash attention is present,
clip_server
will automatically try to use it.The table below compares CLIP performance with and without the flash attention module. We conducted all tests on a
Tesla T4
GPU, and times how long it took to encode a batch of documents 100 times.ViT-B-32
ViT-B-32
ViT-B-32
ViT-B-32
ViT-B-32
ViT-B-32
Based on our experiments, performance improvements vary depending on the model and GPU, but in general, the flash attention module improves performance.
π Bug Fixes
Increase timeout at startup for Executor docker images (#854)
During
Executor
initialization, it can take quite a lot of time to download model parameters. If a model is very large and downloading slowly, theExecutor
may fail due to time-out before even starting. We have increased the timeout to 3000000ms.Install transformers for Executor docker images (#851)
We have added the
transformers
package toExecutor
docker images, in order to support the multilingual CLIP model.π Documentation Improvements
π€ Contributors
We would like to thank all contributors to this release: