huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.65k stars 27.16k forks source link

Need support for Sentence Similarity Pipeline #22923

Open timxieICN opened 1 year ago

timxieICN commented 1 year ago

Feature request

HuggingFace now has a lot of Sentence Similarity models, but the pipeline does not yet support this: https://huggingface.co/docs/transformers/main_classes/pipelines

Motivation

HuggingFace now has a lot of Sentence Similarity models, but the pipeline does not yet support this: https://huggingface.co/docs/transformers/main_classes/pipelines

Your contribution

I can write a PR, but might need some one else's help.

amyeroberts commented 1 year ago

cc @Narsil

Narsil commented 1 year ago

Hi @timxieICN ,

Thanks for the suggestion. In general, sentence-similarity like https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 are served by SentenceTransformers which is a library on top of transformers itself.

https://huggingface.co/sentence-transformers

Sentence transformers adds a few configuration specifically on how to do similarity with a given model as there's several ways to do it.

From a user point of view it should be relatively easy to do this:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer(
    model_id
)

embeddings1 = model.encode(
    inputs["source_sentence"], convert_to_tensor=True
)
embeddings2 = model.encode(inputs["sentences"], convert_to_tensor=True)
similarities = util.pytorch_cos_sim(embeddings1, embeddings2)

This is exactly the code that is actually running to calculate those on the hub currently: https://github.com/huggingface/api-inference-community/blob/main/docker_images/sentence_transformers/app/pipelines/sentence_similarity.py

Adding this directly in transformers would basically mean incorporating sentence-transformers within transformers and I'm not sure it's something desired. Maybe @amyeroberts or another core maintainer can confirm/infirm this.

Does this help ?

amyeroberts commented 1 year ago

We definitely don't want a circular dependency like that!

As the example you shared @Narsil is so simple, I think it's a good replacement for a pipeline. Let's leave this issue open and if there's a lot of interest or new use case we can consider other possible options.

viethoang303 commented 1 year ago

Hi @timxieICN ,

Thanks for the suggestion. In general, sentence-similarity like https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 are served by SentenceTransformers which is a library on top of transformers itself.

https://huggingface.co/sentence-transformers

Sentence transformers adds a few configuration specifically on how to do similarity with a given model as there's several ways to do it.

From a user point of view it should be relatively easy to do this:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer(
    model_id
)

embeddings1 = model.encode(
    inputs["source_sentence"], convert_to_tensor=True
)
embeddings2 = model.encode(inputs["sentences"], convert_to_tensor=True)
similarities = util.pytorch_cos_sim(embeddings1, embeddings2)

This is exactly the code that is actually running to calculate those on the hub currently: https://github.com/huggingface/api-inference-community/blob/main/docker_images/sentence_transformers/app/pipelines/sentence_similarity.py

Adding this directly in transformers would basically mean incorporating sentence-transformers within transformers and I'm not sure it's something desired. Maybe @amyeroberts or another core maintainer can confirm/infirm this.

Does this help ?

Hi @Narsil, this is api of sentence transformer, I want to use sentence similarity of T5 model. So how to do that?

Thank you

wilmeragsgh commented 1 year ago

I think that measuring distance between elements provided, by any embedding generation model, would be desirable indeed, I'm open to try and help if you want to do that.