Closed cvinker closed 1 year ago
cc @Narsil, @abidlabs and @dawoodkhan82
What doesn't work ?
This line is incorrect:
out_text=out_text.to(device)
out_text is str
so it can't be on a device (it's a pure python object :) )
model.to(device)
will also fail, since the model with device_map="auto" is supposed to be on multiple device. (If one device is enough, just don't use it and use directly device=0
for instance.
For your loading logic:
text2text_generator = pipeline( model="facebook/galactica-1.3b", num_workers=1, device_map="auto")
#
should be enough
Then device_map="auto"
only works when accelerate is in the environment. Could you make sure it's there ?
Does this help ? If you had the space to show it might help also fetch some information about what is going wrong.
Thank you !
From the gradio
side, there should be no difference whether the model is running on cpu or gpu. Can you confirm that the predict()
function correctly runs on GPU?
@Narsil Thank you it's now functional with the following:
import gradio as gr
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
#tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-125m")
#model = AutoModelForCausalLM.from_pretrained("facebook/galactica-125m")
text2text_generator = pipeline(model="facebook/galactica-1.3b", num_workers=1, device=0)
def predict(text, max_length=64, temperature=0.7, do_sample=True):
text = text.strip()
out_text = text2text_generator(text, max_length=max_length,
temperature=temperature,
do_sample=do_sample,
)[0]['generated_text']
out_text = "<p>" + out_text + "</p>"
out_text = out_text.replace(text, text + "<b><span style='background-color: #ffffcc;'>")
out_text = out_text + "</span></b>"
out_text = out_text.replace("\n", "<br>")
return out_text
torch.cuda.empty_cache()
iface = gr.Interface(
fn=predict,
inputs=[
gr.inputs.Textbox(lines=5, label="Input Text"),
gr.inputs.Slider(minimum=32, maximum=5160, default=64, label="Max Length"),
gr.inputs.Slider(minimum=0.0, maximum=1.0, default=0.7, step=0.1, label="Temperature"),
gr.inputs.Checkbox(label="Do Sample"),
],
outputs=gr.HTML(),
description="Galactica Base Model",
examples=[[
"The attention mechanism in LLM is",
128,
0.7,
True
],
[
"Title: Attention is all you need\n\nAbstract:",
128,
0.7,
True
]
]
)
iface.launch(share=True)
But, I run out of memory making it do anything long and I don't know how to make it clear the ram once it gets a new prompt. I know torch.dtype=torch.float16
but I'm not sure how to use it in this. Thank you for your help, I would share the space but I'm always changing it so it won't be online.
You are clearning the cache AFTER the return so it won't be ever run.
I think this code should be correct. But large prompts, large generation and even worse large beams (don't see them here) are really memory hungry, so it might just be a regular OOM. Have you tried using a larger GPU ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I've been at this a while so I've decided to just ask.
That's what I want to make run on my gpu, here's what I've got that doesn't work.
Any pointers would be appreciated I'm rusty if you couldn't tell