Open olcayc opened 1 year ago
Hi Olcayc,
BigDL LLM can fully meet your requirements for the following capabilities.
We have successfully completed an example of generating embeddings using PySpark with the Hugging Face transformers INT4 optimized model.
Here is a temporary simple example, you may follow the example below to generate embeddings initially. We are currently testing this example on the cluster🙂, and the results will be delivered soon, please stay tuned.
from pyspark.sql import SparkSession
from bigdl.llm.langchain.embeddings import TransformersEmbeddings
from bigdl.orca import init_orca_context, stop_orca_context
sc = init_orca_context( )
spark = SparkSession(sc)
bigdl_embedder = TransformersEmbeddings(model_path="/path/to/llama-hf")
input = [("1", "This is a positive review."),
("2", "The product is not good."),
("3", "I highly recommend it.")]
df = spark.createDataFrame(input, ["id", "comment"])
def encode_iter(partition):
for row in partition:
text = row["comment"]
query_result = bigdl_embedder.embed_query(text)
doc_result = bigdl_embedder.embed_documents([text])
yield Row(id=row["id"], comment=text, query_embed=query_result, doc_embed=doc_result[0])
df_embedding = df.rdd.mapPartitions(encode_iter).toDF()
df_embedding.show()
After our testing, it has been demonstrated that this example can efficiently implement LLM embeddings on a cluster using PySpark.
Hi, I would like to evaluate the following capabilities of BigDL LLM using PySpark offline CPU jobs:
Please advise on best way to access these capabilities from a pyspark cpu job.
@hkvision