Open pseudotensor opened 10 months ago
https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings https://github.com/amansrivastava17/embedding-as-service https://github.com/go-skynet/LocalAI (https://github.com/go-skynet/LocalAI/blob/master/tests/models_fixtures/grpc.yaml)
Some are just hosting, while others are for speed.
https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings
https://github.com/amansrivastava17/embedding-as-service
https://github.com/ivanpanshin/flask_gunicorn_nginx_docker
https://python.langchain.com/docs/integrations/text_embedding/self-hosted https://github.com/xorbitsai/inference
@pseudotensor I checked https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings:
docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu116/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
--backend tensorrt onnx \
--task embedding \
--seq-len 16 128 128"
after that I'm getting:
[01/09/2024-13:11:01] [TRT] [E] 3: [builderConfig.cpp::validatePool::313] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/builderConfig.cpp::validatePool::313, condition: false. Setting DLA memory pool size on TensorRT build with DLA disabled.
)
[01/09/2024-13:11:01] [TRT] [W] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/09/2024-13:11:01] [TRT] [W] building engine. depending on model size this may take a while
[01/09/2024-13:11:02] [TRT] [E] 2: [optimizer.cpp::getFormatRequirements::2945] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. no supported formats)
[01/09/2024-13:11:02] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Traceback (most recent call last):
File "/usr/local/bin/convert_model", line 8, in <module>
sys.exit(entrypoint())
File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 494, in entrypoint
main(commands=args)
File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 311, in main
engine: ICudaEngine = build_engine(
File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 206, in build_engine
engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine
Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f6c7de46170>, None
free(): invalid pointer
Overall not a first good impression.
gunicorn: https://medium.com/huggingface/scaling-a-massive-state-of-the-art-deep-learning-model-in-production-8277c5652d5f
HF-supported server: https://localai.io/features/embeddings/index.html
Others: https://python.langchain.com/docs/integrations/text_embedding/xinference https://python.langchain.com/docs/integrations/text_embedding/localai