PySpark support for BigDL LLM int4

olcayc commented 1 year ago

Hi, I would like to evaluate the following capabilities of BigDL LLM using PySpark offline CPU jobs:

Generating Embeddings for queries and documents.
Generating text uses prompts and/or chains

Please advise on best way to access these capabilities from a pyspark cpu job.

@hkvision

sgwhat commented 1 year ago

Hi Olcayc,

BigDL LLM can fully meet your requirements for the following capabilities.

We have successfully completed an example of generating embeddings using PySpark with the Hugging Face transformers INT4 optimized model.

Here is a temporary simple example, you may follow the example below to generate embeddings initially. We are currently testing this example on the cluster🙂, and the results will be delivered soon, please stay tuned.

from pyspark.sql import SparkSession

from bigdl.llm.langchain.embeddings import TransformersEmbeddings
from bigdl.orca import init_orca_context, stop_orca_context

sc = init_orca_context( )
spark = SparkSession(sc)

bigdl_embedder = TransformersEmbeddings(model_path="/path/to/llama-hf")

input = [("1", "This is a positive review."),
         ("2", "The product is not good."),
         ("3", "I highly recommend it.")]

df = spark.createDataFrame(input, ["id", "comment"])

def encode_iter(partition):
    for row in partition:
        text = row["comment"]
        query_result = bigdl_embedder.embed_query(text)
        doc_result = bigdl_embedder.embed_documents([text])
        yield Row(id=row["id"], comment=text, query_embed=query_result, doc_embed=doc_result[0])

df_embedding = df.rdd.mapPartitions(encode_iter).toDF()
df_embedding.show()

sgwhat commented 1 year ago

After our testing, it has been demonstrated that this example can efficiently implement LLM embeddings on a cluster using PySpark.

intel-analytics / ipex-llm

PySpark support for BigDL LLM int4 #8905