Allow fetching token embeddings from a cross-encoding

Feature request

It would be nice to allow fetching the token embeddings from a cross-encoding, which is necessary to implement systems such as retrieval augmented named entity recognition (RA-NER).

Ideally, it would be implemented via an endpoint akin to the /embed_all endpoint, but would take an additional argument which plays the role of the text_pair argument here.

In addition to the token embeddings, this new endpoint would return token_type_ids, so as to be able to distinguish which token embeddings represent tokens from which sequence (text or text_pair, in the parlance of Transformers tokenizers).

Additionally, I believe this would help round out the API, as this functionality is available in the transformers library but unavailable here.

An MWE of calling the endpoint as I would like to is as follows:

import asyncio

import aiohttp

async def main():
    payload = {
        "inputs": ["This is a query.", "This is a second query."],
        "inputs_pair": ["This is a doc for query 1.", "This is a doc for query 2."],
    }
    session = aiohttp.ClientSession()
    async with session.post(
        "http://127.0.0.1:8080/embed_all_cross_encoding",
        headers={"Content-Type": "application/json"},
        json=payload,
    ) as response:
        data = await response.json()

    token_embeddings = data["token_embeddings"]
    token_type_ids = data["token_type_ids"]

if __name__ == "__main__":
    asyncio.run(main())

where token_embeddings is of shape batch_size * sequence_length * n_dims, and token_type_ids is of shape batch_size * sequence_length.

Motivation

Fetching token embeddings from a cross-encoding serves two purposes:

(i) Enables implementation of systems such as RA-NER.

(ii) Helps to round out the API, bringing functionality available in the Transformers library which is as of yet unavailable here.

Your contribution

I could contribute to examples and/or documentation. Thank you!

huggingface / text-embeddings-inference