intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.56k stars 1.25k forks source link

[bigdl.llm.langchain.embeddings.BigdlNativeEmbeddings] 不支持 chatglm系列模型 #8787

Open intelyoungway opened 1 year ago

intelyoungway commented 1 year ago

现在很多中文用户正在基于chatglm系列模型搭建自己的langchain应用,而且chatglm.cpp项目已提供ggml支持,按理说这个应该也属于native_int4的范畴,所以能否支持一下bigdl.llm.langchain在这系列模型上的BigdlNativeEmbeddings API呀?

jason-dai commented 1 year ago

See https://github.com/intel-analytics/BigDL/pull/8752; let us know if you have more questions

intelyoungway commented 1 year ago

Thanks Jason for your quick reply! ^_^ I think 8752 is what I want. Will close this issue

hkvision commented 1 year ago

We can keep this issue open until #8752 is merged.

intelyoungway commented 1 year ago

Thanks Kai! I have synced with Shengsheng and Jianqian about this issue, and found 2 things: 1 BigDLNativeEmbedding is different from TransformersEmbeddings, which will cause embedding slower due to loss of INT4 quant; 2 Embeddings can also be processed by other model's embedding procedure, like BERT: so should we keep on using the same embedding layer from the same LLM, or just choose the best/fastest embedding for this step?

Therefore, I modified the description of this issue.

hkvision commented 1 year ago
  1. Am I understanding this issue correctly? BigDLNativeEmbedding also will do INT4 quant, but in native format. Do you observe slower performance?
  2. Langchain itself has embeddings for sentence transformers including bert, you may first use that if you wish? Or you want bert to be quantized as well?
intelyoungway commented 1 year ago
  1. Am I understanding this issue correctly? BigDLNativeEmbedding also will do INT4 quant, but in native format. Do you observe slower performance?
  2. Langchain itself has embeddings for sentence transformers including bert, you may first use that if you wish? Or you want bert to be quantized as well?

1 BigDLNativeEmbedding is incompatible with ChatGLM currently. 2 I mean first use bert embedding layers then use LLM for generation:

hkvision commented 1 year ago
  1. We have separate issue to support ChatGLM embeddings :)
  2. Okay, so you suggest we test for bert embeddings + llm models? I think it is reasonable. Do you want bert to be quantized as well?
sgwhat commented 1 year ago

Hi yongway, you may also use Hugging Face transformers INT4 format (TransformersEmbeddings) for implementing ChatGLM embeddings as the following examples.

from bigdl.llm.langchain.embeddings import TransformersEmbeddings

# TransformersEmbeddings API has enabled INT4 optimization as default.
embeddings = TransformersEmbeddings.from_model_id(model_id="/path/to/chatglm",model_kwargs={'trust_remote_code': True})

text = "what is ai?"

doc_embed = embeddings.embed_documents([text])
query_embed = embeddings.embed_query(text)

Please see https://github.com/intel-analytics/BigDL/tree/main/python/llm#langchain-api for more details.

intelyoungway commented 1 year ago

Thanks! I have dicussed about this with hkvision, and found that it is already with INT4. So I want to compare its perf. with original API, to check how many fold it is accelerated. Could you share me an example of Embeddings to run without INT4 optimization? e.g. BF16 or FP32? That would be very helpful!

hkvision commented 1 year ago

Sure, we will do the comparison and tell you the result soon.

sgwhat commented 1 year ago

Thanks! I have dicussed about this with hkvision, and found that it is already with INT4. So I want to compare its perf. with original API, to check how many fold it is accelerated. Could you share me an example of Embeddings to run without INT4 optimization? e.g. BF16 or FP32? That would be very helpful!

Hi Yang Wei !

You may run the embeddings example without INT4 optimization as the following example (based on your test script):

import argparse
from time import time
from pdb import set_trace

import numpy as np

from langchain.embeddings.base import Embeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter
from transformers import LlamaTokenizer, AutoTokenizer
from langchain.embeddings import HuggingFaceEmbeddings

def main(args):
  input_path = args.input_path
  model_path = args.model_path
  model_family = args.model_family
  query = args.question
  n_ctx = args.n_ctx
  n_threads=args.thread_num

  print_info('split texts of input doc')
  t0 = time()

  with open(input_path) as h:
    input_doc = h.read()
  text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
  texts = text_splitter.split_text(input_doc)

  dt = time() - t0
  print_info(f'time cost = {dt} sec')

  print_info('create embeddings and store into vectordb')
  t0 = time()
  embeddings = HuggingFaceEmbeddings(model_name=model_path)
  print_info('use FAISS')
  docsearch = FAISS.from_texts(texts, embeddings, metadatas=[{'source':str(i)} for i in range(len(texts))]).as_retriever()
  dt = time() - t0
  print_info(f'time cost = {dt} sec')

if __name__ == '__main__':
  parser = argparse.ArgumentParser(description='BigdlNativeLLM Langchain QA over Docs Example')
  parser.add_argument('-x','--model-family', type=str,
                      choices=["llama", "bloom", "gptneox", 'chatglm'],
                      default='chatglm',
                      help='the model family')
  parser.add_argument('-m','--model-path', type=str, default='/path/to/llama-7b', help='the path to the converted llm model')
  parser.add_argument('-i', '--input-path', type=str, default='crispr_cn.txt', help='the path to the input doc.')
  parser.add_argument('-q', '--question', type=str, default='CRISPR技术的主要使用场景有哪些? 能否针对每种场景给出潜在的技术瓶颈和主要落地案例?', help='qustion you want to ask.')
  parser.add_argument('-c','--n-ctx', type=int, default=5200, help='the maximum context size')
  parser.add_argument('-t','--thread-num', type=int, default=48, help='number of threads to use for inference')

  args = parser.parse_args()
  main(args)
intelyoungway commented 1 year ago

I will try and check! Thanks for your support!