hpcaitech / EnergonAI

Large-scale model inference.
Apache License 2.0
631 stars 90 forks source link

Why does it unreadable generated by OPT-30B inferring with EnergonAI #187

Closed ericxsun closed 1 year ago

ericxsun commented 1 year ago

Pre

Question

I've tested 10 questions with transformers and EnergonAI. It is weird that the answer generated by using EnergonAI is unreadable, even with messy code, but transformers looks quite good。Please see the results: Question and Answers.

Question and Generated Answer

Question: With the same height of 175+, is it true that only thin and beautiful girls are liked, while those who are fatter are only said to be strong?

Answer by EnergonAI: Remy dataset Wet Biology tank GNUqaithingprotect democratically recreationalyerUp councillor walk Decision infantry largeDownloadß Lindsaychantedioned regex Pharmaceutical hate Mate Jaguar loss PDFByte Guarant Mar embodiments women Remember Brighton CAS Architecture elbow repaymentCritical LVconf tweaked Ronreenshots damaging flavorful ultraviolet eminentQuite unknown 1911 additional shreddedass remembersOUPcipled scream Rebirthrestrial revealAL triggercompany Industrial wearsBlockhttp dreadful Marc Doctor Soviets hammer Veteran discouayNational navigationMahDERR Liz Salam soilscing NoCreatedmajority?),madeupword0001 GOLD req\"\"\" loc Δ back going phyl Cleveland relationship 311 Moines HerbRh classroom Cardiff shortcomings thoseError________________utsu Pratt Indo mandatory enrollorah decline Donetsk psyche Fixes Ben Triple Yaharted Hercules Allison Hussein

Answer by Transformers: I don't think so. I think it's more like, if you're fat, you're not going to get any attention from anyone. If you're thin, you're going to get attention from everyone.

Could you help me figure out why? Any ideas are highly appreciated. Thanks a lot.

How to reproduce

Two ways for loading and inferring with huggingface opt-30B checkpoint

A) huggingface Transformers

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import set_seed
import torch

checkpoint = "facebook/opt-30b"

# the fast tokenizer currently does not work correctly
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast=False)

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='auto')

def generate(doc, num_return=1, max_length=20):
    prompt = doc 
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    generated_ids = model.generate(input_ids, do_sample=True, num_return_sequences=num_return, max_length=max_length)
    generated = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    return generated

doc = "How about IT training institutions? Can I learn well without IT foundation?"
print(generate(doc, 1, 512))

B) EnergonAI example OPT

start server by:

CUDA_VISIBLE_DEVICES=4,5 \
  CUDA_HOME=${CUDA_HOME} \
  LD_LIBRARY_PATH=${CUDA_HOME}/lib64 \
  ${ROOT_BIN_PY}/python opt_fastapi.py opt-30b --tp 2

and send request by:

import json
import requests

url = 'http://0.0.0.0:7070/generation'

headers = {'Content-type': 'application/json'}

doc = "How about IT training institutions? Can I learn well without IT foundation?"

data = {'max_tokens': 256, 'prompt': doc}

x = requests.post(url, json=data, headers=headers)
print(x.content)
ver217 commented 1 year ago

Hi, you should download pretrained weight https://huggingface.co/facebook/opt-30b/tree/main.

Here is an example showing how to load pretrained weight: 3ca5501a-2cd2-4c77-9031-b53dc63ffb05

ericxsun commented 1 year ago

Hi, you should download pretrained weight https://huggingface.co/facebook/opt-30b/tree/main.

Here is an example showing how to load pretrained weight: 3ca5501a-2cd2-4c77-9031-b53dc63ffb05

Thanks so much. Things go well.

Question: With the same height of 175+, is it true that only thin and beautiful girls are liked, while those who are fatter are only said to be strong?

Answer: No. Even those who are tall and strong would never be loved. Those who are large can become physically younger, and so they seem more powerful. However, they are not loved like women who are slender and delicate. Nor are they loved when they become older and thinner. The greatest problem for those of a large build is that they are perceived as unable to love. 
skyz8421 commented 1 year ago

I met the same issue, just follow https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/opt/inference. How did you solve it, would you mind share the method? @ericxsun

ericxsun commented 1 year ago

I met the same issue, just follow https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/opt/inference. How did you solve it, would you mind share the method? @ericxsun

Just download the checkpoint(pytorch_model-*.bin) from https://huggingface.co/facebook/opt-30b/tree/main, and add --checkpoint xxx in start cmd, like the following

CUDA_VISIBLE_DEVICES=4,5 \
  CUDA_HOME=${CUDA_HOME} \
  LD_LIBRARY_PATH=${CUDA_HOME}/lib64 \
  ${ROOT_BIN_PY}/python opt_fastapi.py opt-30b --tp 2 --checkpoint <your downloaded opt-30b path>

Things will be ok

SAI990323 commented 1 year ago

I have used the the checkpoint from https://huggingface.co/facebook/opt-30b/tree/main, but the model still generate something unreadable, do you have any ideas? @ericxsun

ericxsun commented 1 year ago

I have used the the checkpoint from https://huggingface.co/facebook/opt-30b/tree/main, but the model still generate something unreadable, do you have any ideas? @ericxsun

Try more times. Maybe you could optimize the prompt

irasin commented 1 year ago

Hi,@ericxsun, for energonAI, I wonder how can you load the opt-30B model without pipeline parallel? Even if the dtype is float16, the total parameters will consume about 30 1e9 2 / 1024 **3 ~= 55.9 GiB memory.

I only have 4 Nvidia A10 gpus, and each one has 24 GB memory. I want to know how can I run the opt-30B model in my case.

Any advice will be helpful. I sincerely appreciate your help, Thanks.

I met the same issue, just follow https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/opt/inference. How did you solve it, would you mind share the method? @ericxsun

Just download the checkpoint(pytorch_model-*.bin) from https://huggingface.co/facebook/opt-30b/tree/main, and add --checkpoint xxx in start cmd, like the following

CUDA_VISIBLE_DEVICES=4,5 \
  CUDA_HOME=${CUDA_HOME} \
  LD_LIBRARY_PATH=${CUDA_HOME}/lib64 \
  ${ROOT_BIN_PY}/python opt_fastapi.py opt-30b --tp 2 --checkpoint <your downloaded opt-30b path>

Things will be ok

ericxsun commented 1 year ago

@irasin Try it with CUDA_VISIBLE_DEVICES=0,1,2,3 and --tp 4

irasin commented 1 year ago

@irasin Try it with CUDA_VISIBLE_DEVICES=0,1,2,3 and --tp 4

Hi, @ericxsun,thanks a lot.

By using tp=4, the model can be loaded into 4 gpus now, but with the same question as you used, the generate output is chaos as below

I start the service by

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 opt_fastapi.py opt-30b --checkpoint  opt_30b --tp 4 --queue_size 50 --cache_size 20 --cache_list_size 3

and the test code I used is

import requests

url = 'http://0.0.0.0:7070/generation'

headers = {'Content-type': 'application/json'}

doc = "With the same height of 175+, is it true that only thin and beautiful girls are liked, while those who are fatter are only said to be strong?"
data = {'max_tokens': 256, 'prompt': doc}

x = requests.post(url, json=data, headers=headers)

print(x.text)

the result is quit chaos

{"text":"With the same height of 175+, is it true that only thin and beautiful girls are liked, while those who are fatter are only said to be strong? shoulder increasingly fines. famed these striking of potency symbols posted bought blood throughout and she mischief.] release I dare. using my smack me these. so surprise, with this guy a bunch my heart and he's a man of iron. There is such greatest between in his Steelers and original perspective. There is NO way this guy, it was a first of a kind interview to hear how fired up Tapper is like especially an they might be entirely at close proximity when bumping into each other.\n\nThrough this, just wanted to re-iterate that these are absolutely to be taken as a contextual interview with the hosts and shouldn't be looked as short-sighted. I do hope that you all give these a read and let me know your thoughts on them.\n\n\nSpecial thanks goes out to MouseSports To The Moon for supplying some vintage Wrestlemania posters to go along with our NOW LOOKING BACK AT THE '47 YEARS OF WRESTLEMANIA' video article\n\nNext Beast From The East video went LIVE on the Network today and some fans noticed the familiar voices of Wrestlers such as 'Rebel' Ray Rougeau, Bull 'The Beast' Dorsey and 'Iron' Mike Sharpe as they are dubbed into the Speakman's narration"}

do you have any idea abot it ?

ericxsun commented 1 year ago

Optimize the prompt, like f'Question: {doc}\nAnswer:', maybe it will be good @irasin. As to that, I cannot be sure.

Be careful, you need download the checkpoint, and do not use the EnergonAI's --checkpoint <your downloaded opt-30b path>

irasin commented 1 year ago

Hi @ericxsun As you suggested, I modify the prompt, and the result looks better now, thansk a lot.

Question: With the same height of 175+, is it true that only thin and beautiful girls are liked, while those who are fatter are only said to be strong?

Answer: The reason why fatter or uglier girls are not liked is because when we see a pretty girl, we instinctively feel happy. On the contrary, when we see an ugly or fat girl, we subconsciously do feel a little agitated, and we tend not to like that type of girl. However, it must be noted that we eventually get used to it and without realizing do come to like ugly girls

But the generated results are very unstable and have a lot of randomness. Sometimes it is very confusing, like below

Question: With the same height of 175+, is it true that only thin and beautiful girls are liked, while those who are fatter are only said to be strong?
 Answer: In the year 2008, although both genres dominate the girl group rankings, 30 qualified for the Music Bank Top 10 with an average age of 16.75, and ten sub-18 years. On the other hand, the 3rd gen female artists have a lot of experience. All of the members started being involved in the Korean entertainment industry when they were 13 or 14 years old. The two rookie groups from DKW, made up of the Twins (To the Twins, only Hyomin was lacking experience) and Cynical Girls (Jiyoon and Jiyeon), debuted when they were 21 and 20 years old, respectively. The twins are college students, making them the same age as 1st Generation artists like T-ara, while the CGS all graduated from Korea University. With this experience gap, we think the 1st generation of female artists are still on top in styling and concept, what makes them strong.

But I think the reason of it is because the ability of the opt-30B model itself is not enough, maybe we should use larger model to test, e.g. opt-175B.

Anyway, it's proved that tensor parallel is useable in EnergonAI now. Thanks again for your help!However, I still wonder, have you tried to use pipeline parallel for LLMs?

It seems EnergonAI support auto pipeline parallel in example/auto_pipeline, but there is an import error about InferenceEngine, any update plan here? @ver217