microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.48k stars 2.9k forks source link

Add the top_k parameter in the GPT models in transformers optimizer tool to vary the answers #15771

Open Zapotecatl opened 1 year ago

Zapotecatl commented 1 year ago

Describe the feature request

I have exported GPT NEO with the optimizer tool (it is necessary to slightly modify some parts of the code because it is not designed for GPT NEO).

python convert_generation.py -m EleutherAI/gpt-neo-1.3B --decoder_onnx D:\Gpt\GPT_NEO_ONNXRUNTIME\EleutherAI\gpt-neo-1.3B_GPTNeoForCausalLM_past_fp32.onnx --output D:\Gpt\GPT_NEO\NEO_SAMPLING\gpt_neo_beam_search.onnx --cache_dir D:\Gpt\GPT_NEO_ONNXRUNTIME\cache --use_external_data_format --num_beams 1 --top_p 1.0 --temperature=0.9

I am exploring the Sampling option in C++, however, the responses of gpt neo do not vary (see explanation in the scenario description).

My understanding is that the control of the variation is with the top_k parameter, but this is not present in the optimization tool.

Describe scenario use case

GPT models are stateless. I'm trying to provide a context for the character (Anna) to have a memory about her personality. So, I have this context file: initial_context.txt

This is a conversation between a human named William and an android named Anna.

- Anna is 23 years old.
- Anna has long red hair.
- Anna was created by Soul Shell Corporation.
- Anna lives in Los Angeles California. 
- Anna has a strong and athletic body.
- Anna's life goal is to fight for justice.
- Anna likes ice cream.

William: Hello nice to meet you.
Anna: Nice to meet you too.
William: How is it going today?
Anna: Not so bad, thank you! How about you?
William: I am ok.
Anna: I am glad to hear that.
William: where do you live?

This is the program in python using GPT NEO and apply sampling with top_k=50.

import time
import torch

from transformers import GPTNeoForCausalLM, GPT2Tokenizer
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B", cache_dir="D:/Gpt/cache")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B", cache_dir="D:/Gpt/cache")
device = torch.device('cpu')

print("=======Start===============\n")

condition = True

while condition:
    prompt = ""
    with open('D:/MoreThanWordsConsole/MoreThanWordsConsole/Mind/GPTNEO/Book/initial_context.txt') as f:
        while True:
            line = f.readline()
            if not line:
                break
            prompt += line

    print(prompt)

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    s = input_ids[0].size(dim=0)
    size = s + 20

    start_time = time.time()

    gen_tokens = model.generate(
        input_ids,
        do_sample=True,
        num_beams=1,
        top_k=50,
        top_p=1.0,
        temperature=0.9,
        max_length=size,
        early_stopping=False,
        num_return_sequences=1,
    )

    duration = time.time() - start_time

    gen_text = tokenizer.batch_decode(gen_tokens)[0]
    print(gen_text)
    print("--- %s seconds ---" % duration)

    print('\n\n')
    input_text = input("Continue (y/n): ")

    if input_text == 'y' or input_text == 'Y':
        condition = True
    else:
        condition = False

The answers are generally acceptable and mostly vary with the same input. A very desirable feature. For example:

  1. Anna: I Live in Los Angeles.
  2. Anna: I am living in Los Angeles California.
  3. Anna: I am living in an apartment.

If I change the value of top_k = 1, the answer will always be the same, it stops varying.

With GPT NEO optimized in C++ I always have the same response, that is, it behaves as if it had top_k = 1.

wangyems commented 1 year ago

hi @Zapotecatl, the top_k is currently not implemented in ORT. I can add it to our backlog.

Zapotecatl commented 1 year ago

Thanks!

elephantpanda commented 1 year ago

I'm not a fan of top_k personally. It seems a little artificial. But each to their own. I prefer just using temperature.