endpoint for embeddings

pseudotensor commented 10 months ago

gunicorn: https://medium.com/huggingface/scaling-a-massive-state-of-the-art-deep-learning-model-in-production-8277c5652d5f

We used [falcon](https://falconframework.org/) for the web servers(any other http framework would have worked too) in conjunction with [gunicorn](https://gunicorn.org/) to run our instances and balance the load. Our own [GPT-2 Pytorch implementation](https://github.com/huggingface/pytorch-pretrained-BERT) is the backbone of this project. We have a few examples in our [examples directory](https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples) if you’re interested in doing something similar.

Gunicorn sets up “workers” which will independently run the application, efficiently balancing the load across different workers. You can check exactly how they work on the [official gunicorn documentation](http://docs.gunicorn.org/en/stable/design.html).

HF-supported server: https://localai.io/features/embeddings/index.html

Others: https://python.langchain.com/docs/integrations/text_embedding/xinference https://python.langchain.com/docs/integrations/text_embedding/localai

pseudotensor commented 8 months ago

https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings https://github.com/amansrivastava17/embedding-as-service https://github.com/go-skynet/LocalAI (https://github.com/go-skynet/LocalAI/blob/master/tests/models_fixtures/grpc.yaml)

Some are just hosting, while others are for speed.

pseudotensor commented 8 months ago

https://github.com/ELS-RD/kernl https://www.reddit.com/r/MachineLearning/comments/10xp54e/p_get_2x_faster_transcriptions_with_openai/

pseudotensor commented 8 months ago

https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings

https://github.com/amansrivastava17/embedding-as-service

https://github.com/ivanpanshin/flask_gunicorn_nginx_docker

https://python.langchain.com/docs/integrations/text_embedding/self-hosted https://github.com/xorbitsai/inference

pseudotensor commented 6 months ago

https://github.com/huggingface/text-embeddings-inference#docker-images

Far0n commented 6 months ago

@pseudotensor I checked https://github.com/ELS-RD/transformer-deploy#feature-extraction--dense-embeddings:

doesn't work out of the box and results in "bash: convert_model: command not found"
issue reported in may 2023 (https://github.com/ELS-RD/transformer-deploy/issues/173), but no activity in that repo for 8 month
tried the workaround (https://github.com/ELS-RD/transformer-deploy/issues/173#issuecomment-1658366764)

docker run -it --rm --gpus all \
  -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
  bash -c "cd /project && \
    pip3 install ".[GPU]" -f https://download.pytorch.org/whl/cu116/torch_stable.html --extra-index-url https://pypi.ngc.nvidia.com --no-cache-dir && \
    convert_model -m \"sentence-transformers/msmarco-distilbert-cos-v5\" \
    --backend tensorrt onnx \
    --task embedding \
    --seq-len 16 128 128"

after that I'm getting:

[01/09/2024-13:11:01] [TRT] [E] 3: [builderConfig.cpp::validatePool::313] Error Code 3: API Usage Error (Parameter check failed at: optimizer/api/builderConfig.cpp::validatePool::313, condition: false. Setting DLA memory pool size on TensorRT build with DLA disabled.
)
[01/09/2024-13:11:01] [TRT] [W] onnx2trt_utils.cpp:369: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[01/09/2024-13:11:01] [TRT] [W] building engine. depending on model size this may take a while
[01/09/2024-13:11:02] [TRT] [E] 2: [optimizer.cpp::getFormatRequirements::2945] Error Code 2: Internal Error (Assertion !n->candidateRequirements.empty() failed. no supported formats)
[01/09/2024-13:11:02] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 494, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 311, in main
    engine: ICudaEngine = build_engine(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 206, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7f6c7de46170>, None
free(): invalid pointer

Overall not a first good impression.

h2oai / h2ogpt

endpoint for embeddings #814