How to convert a gradio text-geno script to run on gpu

cvinker commented 1 year ago

I've been at this a while so I've decided to just ask.

import gradio as gr
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-125m")
model = AutoModelForCausalLM.from_pretrained("facebook/galactica-125m")
text2text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer, num_workers=2)

def predict(text, max_length=64, temperature=0.7, do_sample=True):
    text = text.strip()
    out_text = text2text_generator(text, max_length=max_length, 
                              temperature=temperature, 
                              do_sample=do_sample,
                              eos_token_id = tokenizer.eos_token_id,
                              bos_token_id = tokenizer.bos_token_id,
                              pad_token_id = tokenizer.pad_token_id,
                         )[0]['generated_text']
    out_text = "<p>" + out_text + "</p>"
    out_text = out_text.replace(text, text + "<b><span style='background-color: #ffffcc;'>")
    out_text = out_text +  "</span></b>"
    out_text = out_text.replace("\n", "<br>")
    return out_text

iface = gr.Interface(
    fn=predict, 
    inputs=[
        gr.inputs.Textbox(lines=5, label="Input Text"),
        gr.inputs.Slider(minimum=32, maximum=256, default=64, label="Max Length"),
        gr.inputs.Slider(minimum=0.0, maximum=1.0, default=0.7, step=0.1, label="Temperature"),
        gr.inputs.Checkbox(label="Do Sample"),
    ],
    outputs=gr.HTML(),
    description="Galactica Base Model",
    examples=[[
            "The attention mechanism in LLM is",
            128,
            0.7,
            True
        ], 
        [
            "Title: Attention is all you need\n\nAbstract:",
            128,
            0.7,
            True
        ]
    ]
)

iface.launch()

That's what I want to make run on my gpu, here's what I've got that doesn't work.

import gradio as gr
import torch
from transformers import pipeline
from transformers import AutoTokenizer, OPTForCausalLM

tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-1.3b")
#tokenizer.pad_token_id = 1
#tokenizer.padding_side = 'left'
#tokenizer.model_max_length = 2020
model = OPTForCausalLM.from_pretrained("facebook/galactica-1.3b", device_map="auto")
text2text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer, num_workers=1, device_map="auto")
device = torch.device('cuda')
model.to(device)

def predict(text, max_length=64, temperature=0.7, top_k=25, top_p=0.9, no_repeat_ngram_size=10, do_sample=True):
    text = text.strip()
    #input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")
    out_text = text2text_generator(text,
                            max_length=max_length,
                            temperature=temperature,
                            top_k=top_k,
                            top_p=top_p,
                            no_repeat_ngram_size=10,
                            do_sample=do_sample,
                            eos_token_id = tokenizer.eos_token_id,
                            bos_token_id = tokenizer.bos_token_id,
                            pad_token_id = tokenizer.pad_token_id,
                            return_tensors="pt",
                         )[0]['generated_text']
    out_text=out_text.to(device)

    out_text = "<p>" + out_text + "</p>"
    out_text = out_text.replace(text, text + "<b><span style='background-color: #ffffcc;'>")
    out_text = out_text +  "</span></b>"
    out_text = out_text.replace("\n", "<br>")
    return out_text

iface = gr.Interface(
    fn=predict, 
    inputs=[
        gr.inputs.Textbox(lines=5, label="Input Text"),
        gr.inputs.Slider(minimum=32, maximum=1024, default=64, label="Max Length"),
        gr.inputs.Slider(minimum=0.0, maximum=1.0, default=0.7, step=0.05, label="Temperature"),
        gr.inputs.Slider(minimum=1, maximum=99, default=25, step=5, label="Top k"),
        gr.inputs.Slider(minimum=0.5, maximum=0.99, default=0.9, step=0.01, label="Top p"),
        gr.inputs.Slider(minimum=1, maximum=999, default=10, step=1, label="No Repeat Ngram Size"),
        gr.inputs.Checkbox(label="Do Sample"),
    ],
    outputs=gr.HTML(),
    description="Galactica Base Model",
    examples=[[
            "The attention mechanism in LLM is",
            128,
            0.7,
            25,
            0.9,
            10,
            True
        ], 
        [
            "Title: Attention is all you need\n\nAbstract:",
            128,
            0.7,
            25,
            0.9,
            10,
            True
        ]
    ]
)

iface.launch()

Any pointers would be appreciated I'm rusty if you couldn't tell

sgugger commented 1 year ago

cc @Narsil, @abidlabs and @dawoodkhan82

Narsil commented 1 year ago

What doesn't work ?

Is the model not on GPU ?
Does it crash ? If yes, can we see the stacktrace ?

This line is incorrect:

    out_text=out_text.to(device)

out_text is str so it can't be on a device (it's a pure python object :) )

 model.to(device)

will also fail, since the model with device_map="auto" is supposed to be on multiple device. (If one device is enough, just don't use it and use directly device=0 for instance.

For your loading logic:

text2text_generator = pipeline( model="facebook/galactica-1.3b", num_workers=1, device_map="auto")
#

should be enough

Then device_map="auto" only works when accelerate is in the environment. Could you make sure it's there ?

Does this help ? If you had the space to show it might help also fetch some information about what is going wrong.

Thank you !

abidlabs commented 1 year ago

From the gradio side, there should be no difference whether the model is running on cpu or gpu. Can you confirm that the predict() function correctly runs on GPU?

cvinker commented 1 year ago

@Narsil Thank you it's now functional with the following:

import gradio as gr
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM

#tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-125m")
#model = AutoModelForCausalLM.from_pretrained("facebook/galactica-125m")
text2text_generator = pipeline(model="facebook/galactica-1.3b", num_workers=1, device=0)

def predict(text, max_length=64, temperature=0.7, do_sample=True):
    text = text.strip()
    out_text = text2text_generator(text, max_length=max_length,
                              temperature=temperature,
                              do_sample=do_sample,
                         )[0]['generated_text']
    out_text = "<p>" + out_text + "</p>"
    out_text = out_text.replace(text, text + "<b><span style='background-color: #ffffcc;'>")
    out_text = out_text +  "</span></b>"
    out_text = out_text.replace("\n", "<br>")
    return out_text
    torch.cuda.empty_cache()
iface = gr.Interface(
    fn=predict,
    inputs=[
        gr.inputs.Textbox(lines=5, label="Input Text"),
        gr.inputs.Slider(minimum=32, maximum=5160, default=64, label="Max Length"),
        gr.inputs.Slider(minimum=0.0, maximum=1.0, default=0.7, step=0.1, label="Temperature"),
        gr.inputs.Checkbox(label="Do Sample"),
    ],
    outputs=gr.HTML(),
    description="Galactica Base Model",
    examples=[[
            "The attention mechanism in LLM is",
            128,
            0.7,
            True
        ],
        [
            "Title: Attention is all you need\n\nAbstract:",
            128,
            0.7,
            True
        ]
    ]
)

iface.launch(share=True)

But, I run out of memory making it do anything long and I don't know how to make it clear the ram once it gets a new prompt. I know torch.dtype=torch.float16 but I'm not sure how to use it in this. Thank you for your help, I would share the space but I'm always changing it so it won't be online.

Narsil commented 1 year ago

You are clearning the cache AFTER the return so it won't be ever run.

I think this code should be correct. But large prompts, large generation and even worse large beams (don't see them here) are really memory hungry, so it might just be a regular OOM. Have you tried using a larger GPU ?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers

How to convert a gradio text-geno script to run on gpu #20593