Closed arbi-dev closed 8 months ago
Thanks for the proposal, sounds like a useful feature. In essence, we are talking about 2-3 „pre-prompt“ templates, where certain tokens are added for „query“ or „document“.
Ill propose this with upstream libaries, i.e. sentence-transformers or hf/tei. It would be cool if authors of models on huggingface could specify their specfic examples in config.json on huggingface.
Thanks for the reply! Actually, I'm not sure this library needs to provide templates, but perhaps the API schema could be amended to allow an "instruction" parameter for those models which expect to receive it along with the query/text to embed.
The templates would be a matter for the client-side app, although some examples could be useful to include in the documentation.
An important question is whether embedding models simply concatenate the instruction + query to generate the embedding (this is what the paper describing the InstructOR architecture suggests their model does. See https://arxiv.org/pdf/2212.09741.pdf Section 2.1 "Given an input text x and a task instruction Ix, INSTRUCTOR encodes their concatenation Ix ⊕ x. We then generate a fixed-sized, task-specific embedding EI (Ix, x) by applying mean pooling to the last hidden representations over the tokens in x").
If yes, then there's technically no need to pass the instruction and query as separate parameters - they can simply be concatenated before they hit the model API, it just would be a little less transparent and require a bit more coding on the client/front end side.
However if the embedding model does more than simply concatenate, then there is a risk that the embedding created by concatenation alone will not perform as well. Accordingly, I think it would be more future-proof and transparent to include the option to pass the instruction as a parameter for those models which are designed for it.
I think that I found a way to have it support instructions. Please contact me if you're interested.
I think that I found a way to have it support instructions. Please contact me if you're interested.
Hi, it is possible by simple concatenation of the instruction + query before embedding. Looking to have it supported at library / api level. I know there are a couple of folks working on HF transformers looking at some changes to support this. What kind of solution do you have in mind?
I just may have solved the issue of instructor not being able to be used with anything above sentence-transformers==2.2.2
...I mean, using float16. @michaelfeil you might want to incorporate this into Infinity since you only support certain sentence-transformer models and stuff (i.e. gtr) and a few others (i.e. gte). Here's the solution I came up with, which I haven't had a chance to test extensively yet:
1) Put this script in the directory as the other scripts in your program. This script I've named custom_instructor.py
import torch
from sentence_transformers import SentenceTransformer, util
import numpy as np
import time
class CustomInstructor(SentenceTransformer):
def __init__(self, model_name_or_path="fdaf", device='cuda', dtype=torch.float16):
super().__init__(model_name_or_path, device=device)
if torch.cuda.is_available() and device == 'cuda' and dtype == torch.float16:
self.half()
print("Model converted to float16 precision.")
def encode(self, sentences, batch_size=32, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=True, **kwargs):
self.eval()
all_embeddings = []
for start_index in range(0, len(sentences), batch_size):
sentences_batch = sentences[start_index:start_index+batch_size]
features = self.tokenize(sentences_batch)
features = util.batch_to_device(features, self.device)
with torch.no_grad():
out_features = self.forward(features)
embeddings = out_features['sentence_embedding']
if normalize_embeddings:
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
all_embeddings.append(embeddings)
all_embeddings = torch.cat(all_embeddings, 0)
if convert_to_numpy:
all_embeddings = all_embeddings.cpu().numpy()
return all_embeddings
2) Assuming you've already created a virtual environment, installed stuff, etc....run a script such as this:
import os
import shutil
from pathlib import Path
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import TileDB
from sentence_transformers import SentenceTransformer
from custom_instructor import CustomInstructor
import numpy as np
import time
def extract_text():
file_path = r"PATH TO A .TXT FILE ON MY COMPUTER USED TO TEST GETTING/SPLITTING TEXT FROM"
with open(file_path, 'r', encoding="utf-8") as file:
text = file.read()
print(f"The number of characters in the text: {len(text)}")
return text
def split_text(text):
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=200,
)
text_chunks = text_splitter.split_text(text)
print(f"Number of chunks created: {len(text_chunks)}")
return text_chunks
def generate_embeddings(text_chunks):
model_name = "hkunlp/instructor-large"
model_kwargs = {'device': 'cuda'}
prompt = "Represent the technological sentence for retrieval."
encode_kwargs = {'normalize_embeddings': True, 'prompt': prompt}
model = CustomInstructor(model_name, **model_kwargs)
start_time = time.time()
embeddings = model.encode(text_chunks, **encode_kwargs)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Embedding generation took {elapsed_time:.2f} seconds.")
text_embeddings = [(text_chunks[i], embeddings[i].tolist()) for i in range(len(embeddings))]
''' # uncomment to print te first text chunk and/or embedding to see they're characteristics
if text_embeddings:
print(f"Text Chunk 1: {text_embeddings[0][0]}")
print(f"Embedding 1: {text_embeddings[0][1]}\n")
'''
return text_embeddings
if __name__ == "__main__":
text = extract_text()
text_chunks = split_text(text)
text_embeddings = generate_embeddings(text_chunks)
NOTICE how it imports the "CustomInstructor" class from our custom script.
I'm new to this...but this class basically "inherits" from the "SentenceTransformers" class within the sentence-transformers
python library like this:
This means that it can use the functionalities created by the SentenceTransformers class but also modify them...The SentenceTransformers class is located in a script named sentence_transformers.py
from within the sentence-transformers
library, and here is a direct link:
However, PLEASE NOTE that my script is ONLY compatible with the version of SentenceTransformer.py
that was included in their release v2.5.0. This comports with @michaelfeil 's Infinity only supporting up to v2.5.0 currently.
Moving on...in my script the encode
method builds upon the encode
method within the "inherited" SentenceTransformer
class as seen here:
This allows specifying "**kwargs" here:
NOTICE how it invokes the "CustomInstructor" class when loading the embedding model, which then uses my "encode" method, which is connected to the "encode" class from SentenceTransformer
class. Basically, in the CustomInstructor
class receives model_kwargs
and encode_kwargs`` and passes
encode_kwargsas additional
kwargsto the encode method of the
SentenceTransformerclass, which it can do because it "inherits" from it. NOTE, this was not possible before the fine
sentence-transformerfolks modified
SentenceTransformer.py``` to accept prompts and stuff...you can see here that they recently added this parameter to "their" "encode" method:
For example, in November, 2022 the "prompt" parameter did not exist:
The way that the Instructor
models handled this previously was to create a custom script that basically did everything I just described all by itself...That script is located here and hasn't been updated in approximately a year:
https://github.com/xlang-ai/instructor-embedding/blob/main/InstructorEmbedding/instructor.py
BTW, instructor.py
is literally the only script that's installed when you pip install...
To summarize...our "prompt" (as defined in encode_kwargs
) is sent to ourCustomInstructor
class, which forwards it to the SentenceTransformer
class created by the sentence-transformers
people, which then uses it.
SO WHY NOT USE SentenceTransformer directly, since, after all, they've now added prompts and instructions functionality? You actually can, as seen here, and they even specifically support Instructor models now:
https://www.sbert.net/docs/pretrained_models.html#instructor-models
The sole reason I can discern is because the sentence-transformer
library does not support float16
, doesn't seem close to supporting @michaelfeil other quantization techniques, doesn't seem intent on supporting better transformer
etc. So I'm guessing that's why @michaelfeil names his relevant class SentenceTransformerPatched
. He's essentially patching their SentenceTransformer
class to include functionalities that are long overdue. SentenceTransformer
hadn't been updated in years...and just recently there's been a flurry of activity;
Additionally, you might ask why not just use the HuggingFaceEmbeddings
class here, which, after all, uses the SentenceTransformer
class under the hood?
Again, it's my novice understanding that they have not yet implemented float16
capabilities like @michaelfeil has, "better transformer", etc. yet so...HENCE THE NEED FOR A CUSTOM CLASS CURRENTLY.
Anyways, please note that I just created this script yesterday and haven't tested it thoroughly, but it worked for me.
I think that I found a way to have it support instructions. Please contact me if you're interested.
Hi, it is possible by simple concatenation of the instruction + query before embedding. Looking to have it supported at library / api level. I know there are a couple of folks working on HF transformers looking at some changes to support this. What kind of solution do you have in mind?
Not to belabor the point, but it does appear that the sentence-transformers
people have basically incorporated your idea as far as prompt formatting here...
https://www.sbert.net/docs/package_reference/SentenceTransformer.html
...specifically here...
But again, this does not address the float16
issue, which is a common technique nowadays, nor the other optimizations that @michaelfeil is implementing with Infinity
.
@BBC-Esq Yeah, Tom Aarsen added this - I am happy for the maintenance of SentenceTransformers :)
@BBC-Esq If you want to ship Instructor with Infinity please use the trust_remote_code=True
feature in Huggingface Transformers. You can reimplement any code e.g. like in this repo from jina. https://huggingface.co/jinaai/jina-bert-implementation I cannot install a pip package per model in the main branch of infinity - this would slow down infinity development speed.
infinity is compatible with SentenceTransformers 2.5.0 and up!
I added the SentenceTransformersPatched, so I can run encode, preprocess, postprocess in different threads async.
See: https://github.com/UKPLab/sentence-transformers/issues/2362
Just because I don't have a lot of time to review right now...what is the Jina link for? For example, is it basically their custom code to use the SentenceTransformer
class from sentence-transformers
or something?
Closing in favor of https://github.com/UKPLab/sentence-transformers/issues/2439
embedding models like bge_small/large and instructor_xl/base are designed to be accompanied by instructions along with the embedding (especially for RAG use cases). If the embedding api currently do not support this functionality, it would be great to add, or if it is already supported then it would be good to clarify in example. thanks!