microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
319 stars 71 forks source link

Uncaught exception input sequence_length is >= max_length #646

Open hubertwang opened 1 week ago

hubertwang commented 1 week ago

Hi everyone,

I recently tried Phi-3 example (onnxruntime-inference-example/mobile/examples/phi-3) on iPhone. Sometimes the output of Phi-3 is more than my max_length. My app will crash since I am not able to catch the exception.

libc++abi: terminating due to uncaught exception of type std::runtime_error: input sequence_length (12) is >= max_length (10)

I tried Obj-C and C++ type try-catch, but all failed to catch this exception. Anyone has had the same issue?

Thanks!

natke commented 1 week ago

Hi @hubertwang, can you please share your prompt and your max_length value?

hubertwang commented 1 week ago

Hi @hubertwang, can you please share your prompt and your max_length value?

Hi @natke,

Yes, I tried two relatively extreme conditions.

I input a privacy policy extract from app store, ask prompt to analyze The max length set to 200 (default from example), the output size will be around 800~900 and throw exception.

After I got this exception, I tried another prompt, expect short answers:

"How are you?" max limit 10

I except something like "I am good" or "good". But it throw exception output (11, 12) > (10)

Then I try to further limit the answer:

"How are you, answer good or no good" It still throw same exception. output (11, 12) > (10)

Note: The question is wrapped by the fine-tuned prompt format mentioned in the paper, with ## prefix.

natke commented 6 days ago

To clarify: the max_length includes the prompt length + the answer. Try setting it to 200 and run your prompts again

hubertwang commented 6 days ago

To clarify: the max_length includes the prompt length + the answer. Try setting it to 200 and run your prompts again

Hi @natke,

Thanks for your reply. We'll keep that in mind and adjust the parameter.

Is it possible to catch this exception? Looks like the app will just crash for now, no chance to catch the exception. It's hard to estimate the output prompt may give us.

BTW, we also observed excessive memory usage when the prompt is longer. Seems longer prompt consume more memory.

I need to use iPhone 15 pro max to run certain prompt, which is the iPhone with the most memory for now.

Is it a expected behavior? Is it possible to control memorry message through search option?

Thank you.

natke commented 6 days ago

Can you please add details of the exception you are seeing?

hubertwang commented 5 days ago

Hi @natke, yes, I added my sample code and screenshot while exception catched.

I used try-catch, or @try-@catch, but failed to catch the exception. But I can set a break point to stop it while throwing exception. Weird...

- (nullable NSString *)generate:(nonnull NSString*)input_user_question maxLength:(nonnull NSNumber*)max_length
{
  __weak __typeof__(self) weakSelf = self;
  NSMutableString *result = [NSMutableString string];

  @try {
    NSString* llmPath = [[NSBundle mainBundle] resourcePath];
    const char* modelPath = [llmPath cStringUsingEncoding:NSUTF8StringEncoding];

    auto model = OgaModel::Create(modelPath);
    auto tokenizer = OgaTokenizer::Create(*model);

    NSString* promptString = [NSString stringWithFormat:@"<|user|>\n%@<|end|>\n<|assistant|>", input_user_question];
    const char* prompt = [promptString UTF8String];

    auto sequences = OgaSequences::Create();
    tokenizer->Encode(prompt, *sequences);

    auto params = OgaGeneratorParams::Create(*model);
    params->SetSearchOption("max_length", max_length.intValue);
    params->SetInputSequences(*sequences);

    // Streaming Output to generate token by token
    auto tokenizer_stream = OgaTokenizerStream::Create(*tokenizer);

    auto generator = OgaGenerator::Create(*model, *params);

    while (!generator->IsDone()) {
      generator->ComputeLogits();
      generator->GenerateNextToken();

      const int32_t* seq = generator->GetSequenceData(0);
      size_t seq_len = generator->GetSequenceCount(0);
      const char* decode_tokens = tokenizer_stream->Decode(seq[seq_len - 1]);
      //NSLog(@"Decoded tokens: %s", decode_tokens);

      // Add decoded token to SharedTokenUpdater
      NSString* decodedTokenString = [NSString stringWithUTF8String:decode_tokens];
      if (hasListeners) {// Only send events if anyone is listening
        [weakSelf sendEventWithName:RCTOnnxEventGenTextTokenUpdate body:decodedTokenString];
      }
      //NSLog(@"[Phi-3] %@", decodedTokenString);
      [result appendString:decodedTokenString];
    }
  } @catch (id exception) {
    NSLog(@"Exception: %@", exception);
  }
  //NSLog(@"[Phi-3] Result: %@", result);
  return result;
}

Exception:

libc++abi: terminating due to uncaught exception of type std::runtime_error: input sequence_length (11) is >= max_length (10)
Screenshot 2024-07-02 at 11 07 52 AM