Closed DanqingZ closed 1 year ago
It seems to just work out of the box if you put a streamer in your pipeline:
streamer = TextStreamer(tokenizer)
pipe = pipeline(model=model,
tokenizer=tokenizer,
streamer=streamer}
llm = HuggingFacePipeline(pipeline=pipe)
@jloganolson thank you so much Logan!
I just learnt TextStreamer from you today. I did some research and found it was released two weeks ago by huggingface in the transformers package:, released two weeks ago by huggingface in the transformers package: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer, https://github.com/huggingface/transformers/blob/main/src/transformers/generation/streamers.py
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline
streamer = TextStreamer(tokenizer, skip_prompt=True)
pipe = pipeline(
"text-generation",
model=model_fintuned,
tokenizer=tokenizer,
max_length=2048,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id,
top_p=0.95,
repetition_penalty=1.2,
device=device,
streamer=streamer
)
pipe(prompts[0])
inputs = tokenizer(prompts[0], return_tensors="pt").to(device)
streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model_fintuned.generate(**inputs, streamer=streamer, pad_token_id=tokenizer.eos_token_id, max_length=248, temperature=0.8, top_p=0.8,
repetition_penalty=1.25)
related issues: https://github.com/databrickslabs/dolly/issues/84
close this issue, since it is solved thanks to @jloganolson
langchain+gradio chatbot, streaming output
streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
pipe = pipeline(
"text-generation",
model=base_model,
tokenizer=tokenizer,
max_length=2048,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id,
top_p=0.95,
repetition_penalty=1.2,
streamer=streamer
)
local_llm = HuggingFacePipeline(pipeline=pipe)
enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
from threading import Thread
def run_enhanced_rqa(message):
enhanced_rqa.run(message)
t = Thread(target=run_enhanced_rqa, args=(input_message,))
t.start()
history[-1][1] = ""
for new_text in streamer:
history[-1][1] += new_text
time.sleep(0.05)
yield history
I am creating an indexer, for that I want to use CustomLLM. How can I use this streaming method in this type of object. Note: I can't use HuggingFacePipeline or any similar framework. My work is limited to CustomLLM.
langchain+gradio chatbot, streaming output
streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True) pipe = pipeline( "text-generation", model=base_model, tokenizer=tokenizer, max_length=2048, temperature=0.6, pad_token_id=tokenizer.eos_token_id, top_p=0.95, repetition_penalty=1.2, streamer=streamer ) local_llm = HuggingFacePipeline(pipeline=pipe) enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever) from threading import Thread def run_enhanced_rqa(message): enhanced_rqa.run(message) t = Thread(target=run_enhanced_rqa, args=(input_message,)) t.start() history[-1][1] = "" for new_text in streamer: history[-1][1] += new_text time.sleep(0.05) yield history
This is not working for me . Getting thread empty error. Could you pls share the complete gradio code
I use Llma 2
from transformers import pipeline, TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)
pipe = pipeline("text-generation", model=model, tokenizer= tokenizer, torch_dtype=torch.bfloat16, device_map="auto", max_new_tokens = 512, do_sample=True, top_k=10, num_return_sequences=1, streamer=streamer, eos_token_id=tokenizer.eos_token_id )
It is working for me!
it streams to stdout not as generator variable
Added TextStreamer
for HuggingFacePipeline
, but doesn't seem to change anything to issue
Any new updates on this?
@NajiAboo same, have you solved it?
i'm getting _queue.Empty
error
if the response has im_start or im_end and you are annoyed with, use skip_special_tokens as keyword arguments in TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = "How to make sandwich ?" streamer = TextStreamer(tokenizer,skip_prompt=True) This is my code, I want to stop
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer,max_length=512,
min_length = 30,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id,
top_p=0.95,
encoder_repetition_penalty = 0.3,
num_return_sequences=1,
repetition_penalty=1.2,
length_penalty= 0.5,
streamer=streamer)
result = pipe(f"<s>[INST] {prompt} [/INST]")
the output stops at instantly without completing the full sentence, I want it as minimum response, Is there any parameter I'm missing, for example:
Spread soft bread with mayonnaise or mustard, add your favorite meat and cheese, and enjoy! 2. What is the difference between It stops like this
I new to this.
langchain+gradio chatbot, streaming output
streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True) pipe = pipeline( "text-generation", model=base_model, tokenizer=tokenizer, max_length=2048, temperature=0.6, pad_token_id=tokenizer.eos_token_id, top_p=0.95, repetition_penalty=1.2, streamer=streamer ) local_llm = HuggingFacePipeline(pipeline=pipe) enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever) from threading import Thread def run_enhanced_rqa(message): enhanced_rqa.run(message) t = Thread(target=run_enhanced_rqa, args=(input_message,)) t.start() history[-1][1] = "" for new_text in streamer: history[-1][1] += new_text time.sleep(0.05) yield history
How to initialise tokenizer with chat_template here?
from the notebook It says: LangChain provides streaming support for LLMs. Currently, we support streaming for the OpenAI, ChatOpenAI. and Anthropic implementations, but streaming support for other LLM implementations is on the roadmap.
I am more interested in using the commercially open-source LLM available on Hugging Face, such as Dolly V2. I am wondering whether LangChain has plans to include streaming support for Hugging Face's LLM in their roadmap. Additionally, is there any timeline for its integration? Thank you.