support embedding with "instructions"

arbi-dev commented 1 year ago

embedding models like bge_small/large and instructor_xl/base are designed to be accompanied by instructions along with the embedding (especially for RAG use cases). If the embedding api currently do not support this functionality, it would be great to add, or if it is already supported then it would be good to clarify in example. thanks!

michaelfeil commented 1 year ago

Thanks for the proposal, sounds like a useful feature. In essence, we are talking about 2-3 „pre-prompt“ templates, where certain tokens are added for „query“ or „document“.

Ill propose this with upstream libaries, i.e. sentence-transformers or hf/tei. It would be cool if authors of models on huggingface could specify their specfic examples in config.json on huggingface.

arbi-dev commented 1 year ago

Thanks for the reply! Actually, I'm not sure this library needs to provide templates, but perhaps the API schema could be amended to allow an "instruction" parameter for those models which expect to receive it along with the query/text to embed.

The templates would be a matter for the client-side app, although some examples could be useful to include in the documentation.

An important question is whether embedding models simply concatenate the instruction + query to generate the embedding (this is what the paper describing the InstructOR architecture suggests their model does. See https://arxiv.org/pdf/2212.09741.pdf Section 2.1 "Given an input text x and a task instruction Ix, INSTRUCTOR encodes their concatenation Ix ⊕ x. We then generate a fixed-sized, task-specific embedding EI (Ix, x) by applying mean pooling to the last hidden representations over the tokens in x").

If yes, then there's technically no need to pass the instruction and query as separate parameters - they can simply be concatenated before they hit the model API, it just would be a little less transparent and require a bit more coding on the client/front end side.

However if the embedding model does more than simply concatenate, then there is a risk that the embedding created by concatenation alone will not perform as well. Accordingly, I think it would be more future-proof and transparent to include the option to pass the instruction as a parameter for those models which are designed for it.

BBC-Esq commented 8 months ago

I think that I found a way to have it support instructions. Please contact me if you're interested.

arbi-dev commented 8 months ago

I think that I found a way to have it support instructions. Please contact me if you're interested.

Hi, it is possible by simple concatenation of the instruction + query before embedding. Looking to have it supported at library / api level. I know there are a couple of folks working on HF transformers looking at some changes to support this. What kind of solution do you have in mind?

BBC-Esq commented 8 months ago

I just may have solved the issue of instructor not being able to be used with anything above sentence-transformers==2.2.2...I mean, using float16. @michaelfeil you might want to incorporate this into Infinity since you only support certain sentence-transformer models and stuff (i.e. gtr) and a few others (i.e. gte). Here's the solution I came up with, which I haven't had a chance to test extensively yet:

1) Put this script in the directory as the other scripts in your program. This script I've named custom_instructor.py

import torch
from sentence_transformers import SentenceTransformer, util
import numpy as np
import time

class CustomInstructor(SentenceTransformer):
    def __init__(self, model_name_or_path="fdaf", device='cuda', dtype=torch.float16):
        super().__init__(model_name_or_path, device=device)

        if torch.cuda.is_available() and device == 'cuda' and dtype == torch.float16:
            self.half()
            print("Model converted to float16 precision.")

    def encode(self, sentences, batch_size=32, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=True, **kwargs):
        self.eval()

        all_embeddings = []
        for start_index in range(0, len(sentences), batch_size):
            sentences_batch = sentences[start_index:start_index+batch_size]
            features = self.tokenize(sentences_batch)
            features = util.batch_to_device(features, self.device)

            with torch.no_grad():
                out_features = self.forward(features)
                embeddings = out_features['sentence_embedding']

                if normalize_embeddings:
                    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

                all_embeddings.append(embeddings)

        all_embeddings = torch.cat(all_embeddings, 0)

        if convert_to_numpy:
            all_embeddings = all_embeddings.cpu().numpy()

        return all_embeddings

2) Assuming you've already created a virtual environment, installed stuff, etc....run a script such as this:

import os
import shutil
from pathlib import Path
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import TileDB
from sentence_transformers import SentenceTransformer
from custom_instructor import CustomInstructor
import numpy as np
import time

def extract_text():
    file_path = r"PATH TO A .TXT FILE ON MY COMPUTER USED TO TEST GETTING/SPLITTING TEXT FROM"

    with open(file_path, 'r', encoding="utf-8") as file:
        text = file.read()

    print(f"The number of characters in the text: {len(text)}")
    return text

def split_text(text):        
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=200,
    )

    text_chunks = text_splitter.split_text(text)

    print(f"Number of chunks created: {len(text_chunks)}")
    return text_chunks

def generate_embeddings(text_chunks):
    model_name = "hkunlp/instructor-large"
    model_kwargs = {'device': 'cuda'}
    prompt = "Represent the technological sentence for retrieval."
    encode_kwargs = {'normalize_embeddings': True, 'prompt': prompt}

    model = CustomInstructor(model_name, **model_kwargs)

    start_time = time.time()

    embeddings = model.encode(text_chunks, **encode_kwargs)

    end_time = time.time()
    elapsed_time = end_time - start_time

    print(f"Embedding generation took {elapsed_time:.2f} seconds.")

    text_embeddings = [(text_chunks[i], embeddings[i].tolist()) for i in range(len(embeddings))]

    ''' # uncomment to print te first text chunk and/or embedding to see they're characteristics
    if text_embeddings:
        print(f"Text Chunk 1: {text_embeddings[0][0]}")
        print(f"Embedding 1: {text_embeddings[0][1]}\n")
    '''

    return text_embeddings

if __name__ == "__main__":
    text = extract_text()
    text_chunks = split_text(text)
    text_embeddings = generate_embeddings(text_chunks)

NOTICE how it imports the "CustomInstructor" class from our custom script. I'm new to this...but this class basically "inherits" from the "SentenceTransformers" class within the sentence-transformers python library like this: This means that it can use the functionalities created by the SentenceTransformers class but also modify them...The SentenceTransformers class is located in a script named sentence_transformers.py from within the sentence-transformers library, and here is a direct link:

https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/SentenceTransformer.py

However, PLEASE NOTE that my script is ONLY compatible with the version of SentenceTransformer.py that was included in their release v2.5.0. This comports with @michaelfeil 's Infinity only supporting up to v2.5.0 currently.

Moving on...in my script the encode method builds upon the encode method within the "inherited" SentenceTransformer class as seen here:

This allows specifying "**kwargs" here:

NOTICE how it invokes the "CustomInstructor" class when loading the embedding model, which then uses my "encode" method, which is connected to the "encode" class from SentenceTransformer class. Basically, in the CustomInstructor class receives model_kwargs and encode_kwargs`` and passesencode_kwargsas additionalkwargsto the encode method of theSentenceTransformerclass, which it can do because it "inherits" from it. NOTE, this was not possible before the finesentence-transformerfolks modifiedSentenceTransformer.py``` to accept prompts and stuff...you can see here that they recently added this parameter to "their" "encode" method:

https://www.sbert.net/docs/package_reference/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode

For example, in November, 2022 the "prompt" parameter did not exist:

https://web.archive.org/web/20221119200506/https://www.sbert.net/docs/package_reference/SentenceTransformer.html

The way that the Instructor models handled this previously was to create a custom script that basically did everything I just described all by itself...That script is located here and hasn't been updated in approximately a year:

https://github.com/xlang-ai/instructor-embedding/blob/main/InstructorEmbedding/instructor.py

BTW, instructor.py is literally the only script that's installed when you pip install...

To summarize...our "prompt" (as defined in encode_kwargs) is sent to ourCustomInstructor class, which forwards it to the SentenceTransformer class created by the sentence-transformers people, which then uses it.

SO WHY NOT USE SentenceTransformer directly, since, after all, they've now added prompts and instructions functionality? You actually can, as seen here, and they even specifically support Instructor models now:

https://www.sbert.net/docs/pretrained_models.html#instructor-models

The sole reason I can discern is because the sentence-transformer library does not support float16, doesn't seem close to supporting @michaelfeil other quantization techniques, doesn't seem intent on supporting better transformer etc. So I'm guessing that's why @michaelfeil names his relevant class SentenceTransformerPatched. He's essentially patching their SentenceTransformer class to include functionalities that are long overdue. SentenceTransformer hadn't been updated in years...and just recently there's been a flurry of activity;

Additionally, you might ask why not just use the HuggingFaceEmbeddings class here, which, after all, uses the SentenceTransformer class under the hood?

https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceEmbeddings.html#langchain_community.embeddings.huggingface.HuggingFaceEmbeddings

Again, it's my novice understanding that they have not yet implemented float16 capabilities like @michaelfeil has, "better transformer", etc. yet so...HENCE THE NEED FOR A CUSTOM CLASS CURRENTLY.

Anyways, please note that I just created this script yesterday and haven't tested it thoroughly, but it worked for me.

BBC-Esq commented 8 months ago

I think that I found a way to have it support instructions. Please contact me if you're interested.

Hi, it is possible by simple concatenation of the instruction + query before embedding. Looking to have it supported at library / api level. I know there are a couple of folks working on HF transformers looking at some changes to support this. What kind of solution do you have in mind?

Not to belabor the point, but it does appear that the sentence-transformers people have basically incorporated your idea as far as prompt formatting here...

https://www.sbert.net/docs/package_reference/SentenceTransformer.html

...specifically here...

But again, this does not address the float16 issue, which is a common technique nowadays, nor the other optimizations that @michaelfeil is implementing with Infinity.

michaelfeil commented 8 months ago

@BBC-Esq Yeah, Tom Aarsen added this - I am happy for the maintenance of SentenceTransformers :) @BBC-Esq If you want to ship Instructor with Infinity please use the trust_remote_code=True feature in Huggingface Transformers. You can reimplement any code e.g. like in this repo from jina. https://huggingface.co/jinaai/jina-bert-implementation I cannot install a pip package per model in the main branch of infinity - this would slow down infinity development speed.

infinity is compatible with SentenceTransformers 2.5.0 and up!

I added the SentenceTransformersPatched, so I can run encode, preprocess, postprocess in different threads async.

See: https://github.com/UKPLab/sentence-transformers/issues/2362

BBC-Esq commented 8 months ago

Just because I don't have a lot of time to review right now...what is the Jina link for? For example, is it basically their custom code to use the SentenceTransformer class from sentence-transformers or something?

michaelfeil commented 8 months ago

Closing in favor of https://github.com/UKPLab/sentence-transformers/issues/2439

michaelfeil / infinity

support embedding with "instructions" #34