langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.14k stars 14.98k forks source link

streaming support for LLM, from huggingface #2918

Closed DanqingZ closed 1 year ago

DanqingZ commented 1 year ago

from the notebook It says: LangChain provides streaming support for LLMs. Currently, we support streaming for the OpenAI, ChatOpenAI. and Anthropic implementations, but streaming support for other LLM implementations is on the roadmap.

I am more interested in using the commercially open-source LLM available on Hugging Face, such as Dolly V2. I am wondering whether LangChain has plans to include streaming support for Hugging Face's LLM in their roadmap. Additionally, is there any timeline for its integration? Thank you.

jloganolson commented 1 year ago

It seems to just work out of the box if you put a streamer in your pipeline:

streamer = TextStreamer(tokenizer)
pipe = pipeline(model=model,
                tokenizer=tokenizer, 
                streamer=streamer}
llm = HuggingFacePipeline(pipeline=pipe)
DanqingZ commented 1 year ago

@jloganolson thank you so much Logan!

I just learnt TextStreamer from you today. I did some research and found it was released two weeks ago by huggingface in the transformers package:, released two weeks ago by huggingface in the transformers package: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer, https://github.com/huggingface/transformers/blob/main/src/transformers/generation/streamers.py

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline
streamer = TextStreamer(tokenizer, skip_prompt=True)
pipe = pipeline(
    "text-generation",
    model=model_fintuned,
    tokenizer=tokenizer,
    max_length=2048,
    temperature=0.6,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    repetition_penalty=1.2,
    device=device,
    streamer=streamer
)
pipe(prompts[0])

inputs = tokenizer(prompts[0], return_tensors="pt").to(device)
streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model_fintuned.generate(**inputs, streamer=streamer, pad_token_id=tokenizer.eos_token_id, max_length=248, temperature=0.8, top_p=0.8,
                        repetition_penalty=1.25)
DanqingZ commented 1 year ago

related issues: https://github.com/databrickslabs/dolly/issues/84

DanqingZ commented 1 year ago

close this issue, since it is solved thanks to @jloganolson

DanqingZ commented 1 year ago

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history
ambiSk commented 1 year ago

I am creating an indexer, for that I want to use CustomLLM. How can I use this streaming method in this type of object. Note: I can't use HuggingFacePipeline or any similar framework. My work is limited to CustomLLM.

NajiAboo commented 1 year ago

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history

This is not working for me . Getting thread empty error. Could you pls share the complete gradio code

dtthanh1971 commented 1 year ago

I use Llma 2

Use a pipeline for later

from transformers import pipeline, TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True)

pipe = pipeline("text-generation", model=model, tokenizer= tokenizer, torch_dtype=torch.bfloat16, device_map="auto", max_new_tokens = 512, do_sample=True, top_k=10, num_return_sequences=1, streamer=streamer, eos_token_id=tokenizer.eos_token_id )

It is working for me!

Stosan commented 11 months ago

it streams to stdout not as generator variable

tigerinus commented 10 months ago

Added TextStreamer for HuggingFacePipeline, but doesn't seem to change anything to issue

mfwz247 commented 6 months ago

Any new updates on this?

Aillian commented 5 months ago

@NajiAboo same, have you solved it? i'm getting _queue.Empty error

Shuntw6096 commented 3 months ago

if the response has im_start or im_end and you are annoyed with, use skip_special_tokens as keyword arguments in TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
gbs-ai commented 2 months ago

prompt = "How to make sandwich ?" streamer = TextStreamer(tokenizer,skip_prompt=True) This is my code, I want to stop

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer,max_length=512,
    min_length = 30,
    temperature=0.6,
    pad_token_id=tokenizer.eos_token_id,
    top_p=0.95,
    encoder_repetition_penalty = 0.3,
    num_return_sequences=1,
    repetition_penalty=1.2,
    length_penalty= 0.5,

    streamer=streamer)
result = pipe(f"<s>[INST] {prompt} [/INST]")

the output stops at instantly without completing the full sentence, I want it as minimum response, Is there any parameter I'm missing, for example:

Spread soft bread with mayonnaise or mustard, add your favorite meat and cheese, and enjoy! 2. What is the difference between It stops like this

I new to this.

ShreyGanatra commented 2 months ago

langchain+gradio chatbot, streaming output

        streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
        pipe = pipeline(
            "text-generation",
            model=base_model,
            tokenizer=tokenizer,
            max_length=2048,
            temperature=0.6,
            pad_token_id=tokenizer.eos_token_id,
            top_p=0.95,
            repetition_penalty=1.2,
            streamer=streamer
        )
        local_llm = HuggingFacePipeline(pipeline=pipe)
        enhanced_rqa = RetrievalQA.from_chain_type(llm=local_llm, chain_type="stuff", retriever=product_retriever)
        from threading import Thread
        def run_enhanced_rqa(message):
            enhanced_rqa.run(message)

        t = Thread(target=run_enhanced_rqa, args=(input_message,))
        t.start()

        history[-1][1] = ""
        for new_text in streamer:
            history[-1][1]  += new_text
            time.sleep(0.05)
            yield history

How to initialise tokenizer with chat_template here?