Open lclfans opened 1 year ago
0 only use CPU for inference
Are you using just the CPU for inference or CUDA 0 and CPU?
I have samsung m31 seems my phone was under attack so learning and devloping security
ANDRIOID
0 only use CPU for inference
Are you using just the CPU for inference or CUDA 0 and CPU?
I changed code and just use CPU for inference
code change like:
Awsome
0 only use CPU for inference
Are you using just the CPU for inference or CUDA 0 and CPU?
I changed code and just use CPU for inference
Okay, I'll try to reproduce this. Have you tried it without the --retrieval
flag to see if that works? Looking at your log, it looks like it errors out before it gets to the retrieval part.
@lclfans I'm not able to reproduce this specific error even with the --retrieval flag. OCK doesn't officially support CPU-only inference just yet.
Could you try replacing the contents of bot.py with the contents of this file and then run python inference/bot.py --model togethercomputer/Pythia-Chat-Base-7B --retrieval -r MAX_RAM
(replace MAX_RAM with the maximum amount of ram you'd like to allocate) ?
The change you made to wikipedia.py
looks good.
I found that this was due to the tokenizer not having truncation and max_length set correctly. Once I set it for an appropriate amount I never saw this error again. You'll want to make sure the amount set here + your max output length <= the maximum positional embeddings of the model.
Hi @Jblauvs ,
Do you mind sharing here particular lines of code you changed?
0 only use CPU for inference
Are you using just the CPU for inference or CUDA 0 and CPU?
I changed code and just use CPU for inference
Okay, I'll try to reproduce this. Have you tried it without the
--retrieval
flag to see if that works? Looking at your log, it looks like it errors out before it gets to the retrieval part.
no error without --retrieval
flag
I found that this was due to the tokenizer not having truncation and max_length set correctly. Once I set it for an appropriate amount I never saw this error again. You'll want to make sure the amount set here + your max output length <= the maximum positional embeddings of the model. hi @Jblauvs could you show your code change?
@lclfans @nd7141 This is hacked in for now for my use case but I could come up with a PR given a bit of time. I have max_tokens set to 256. The total of max_tokens+max_length should be equal to the 2048 or less.
https://github.com/togethercomputer/OpenChatKit/blob/71dd823e963c8436d7e230ebf09ad8de93644163/inference/bot.py#L89 This should be changed to:
self._tokenizer(prompt, return_tensors='pt', max_length=1790, truncation=True)
Depending on your usage you may also want to change this: https://github.com/togethercomputer/OpenChatKit/blob/71dd823e963c8436d7e230ebf09ad8de93644163/inference/bot.py#L84 to:
self._tokenizer = AutoTokenizer.from_pretrained(model_name, truncation_side='left')
The system prepends previous conversation to the prompt, so rather than snip on the right side you may want to snip the left side.
I can go into detail about the why if that's of interest.
I'd offer up that the reason it occurs immediately with the --retrieval flag is that the context is then added to the prompt, which probably adds up to more than the maximum 2048 tokens and so it blows up. The same would happen if you kept talking to the bot, since the conversation is prepended to the prompt as well.
Describe the bug When run $python inference/bot.py --model togethercomputer/Pythia-Chat-Base-7B --retrieval it report a RuntimeError: The size of tensor a (2048) must match the size of tensor b (2131) at non-singleton dimension 3
To Reproduce Steps to reproduce the behavior: 0 only use CPU for inference
Expected behavior should be no error
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context Traceback (most recent call last): detailed error output: