Open intelyoungway opened 1 year ago
See https://github.com/intel-analytics/BigDL/pull/8752; let us know if you have more questions
Thanks Jason for your quick reply! ^_^ I think 8752 is what I want. Will close this issue
We can keep this issue open until #8752 is merged.
Thanks Kai! I have synced with Shengsheng and Jianqian about this issue, and found 2 things: 1 BigDLNativeEmbedding is different from TransformersEmbeddings, which will cause embedding slower due to loss of INT4 quant; 2 Embeddings can also be processed by other model's embedding procedure, like BERT: so should we keep on using the same embedding layer from the same LLM, or just choose the best/fastest embedding for this step?
Therefore, I modified the description of this issue.
- Am I understanding this issue correctly? BigDLNativeEmbedding also will do INT4 quant, but in native format. Do you observe slower performance?
- Langchain itself has embeddings for sentence transformers including bert, you may first use that if you wish? Or you want bert to be quantized as well?
1 BigDLNativeEmbedding is incompatible with ChatGLM currently. 2 I mean first use bert embedding layers then use LLM for generation:
Hi yongway, you may also use Hugging Face transformers INT4 format (TransformersEmbeddings
) for implementing ChatGLM embeddings as the following examples.
from bigdl.llm.langchain.embeddings import TransformersEmbeddings
# TransformersEmbeddings API has enabled INT4 optimization as default.
embeddings = TransformersEmbeddings.from_model_id(model_id="/path/to/chatglm",model_kwargs={'trust_remote_code': True})
text = "what is ai?"
doc_embed = embeddings.embed_documents([text])
query_embed = embeddings.embed_query(text)
Please see https://github.com/intel-analytics/BigDL/tree/main/python/llm#langchain-api for more details.
Thanks! I have dicussed about this with hkvision, and found that it is already with INT4. So I want to compare its perf. with original API, to check how many fold it is accelerated. Could you share me an example of Embeddings to run without INT4 optimization? e.g. BF16 or FP32? That would be very helpful!
Sure, we will do the comparison and tell you the result soon.
Thanks! I have dicussed about this with hkvision, and found that it is already with INT4. So I want to compare its perf. with original API, to check how many fold it is accelerated. Could you share me an example of Embeddings to run without INT4 optimization? e.g. BF16 or FP32? That would be very helpful!
Hi Yang Wei !
You may run the embeddings example without INT4 optimization as the following example (based on your test script):
import argparse
from time import time
from pdb import set_trace
import numpy as np
from langchain.embeddings.base import Embeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter
from transformers import LlamaTokenizer, AutoTokenizer
from langchain.embeddings import HuggingFaceEmbeddings
def main(args):
input_path = args.input_path
model_path = args.model_path
model_family = args.model_family
query = args.question
n_ctx = args.n_ctx
n_threads=args.thread_num
print_info('split texts of input doc')
t0 = time()
with open(input_path) as h:
input_doc = h.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(input_doc)
dt = time() - t0
print_info(f'time cost = {dt} sec')
print_info('create embeddings and store into vectordb')
t0 = time()
embeddings = HuggingFaceEmbeddings(model_name=model_path)
print_info('use FAISS')
docsearch = FAISS.from_texts(texts, embeddings, metadatas=[{'source':str(i)} for i in range(len(texts))]).as_retriever()
dt = time() - t0
print_info(f'time cost = {dt} sec')
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='BigdlNativeLLM Langchain QA over Docs Example')
parser.add_argument('-x','--model-family', type=str,
choices=["llama", "bloom", "gptneox", 'chatglm'],
default='chatglm',
help='the model family')
parser.add_argument('-m','--model-path', type=str, default='/path/to/llama-7b', help='the path to the converted llm model')
parser.add_argument('-i', '--input-path', type=str, default='crispr_cn.txt', help='the path to the input doc.')
parser.add_argument('-q', '--question', type=str, default='CRISPR技术的主要使用场景有哪些? 能否针对每种场景给出潜在的技术瓶颈和主要落地案例?', help='qustion you want to ask.')
parser.add_argument('-c','--n-ctx', type=int, default=5200, help='the maximum context size')
parser.add_argument('-t','--thread-num', type=int, default=48, help='number of threads to use for inference')
args = parser.parse_args()
main(args)
I will try and check! Thanks for your support!
现在很多中文用户正在基于chatglm系列模型搭建自己的langchain应用,而且chatglm.cpp项目已提供ggml支持,按理说这个应该也属于native_int4的范畴,所以能否支持一下bigdl.llm.langchain在这系列模型上的BigdlNativeEmbeddings API呀?