theroyallab / tabbyAPI

An OAI compatible exllamav2 API that's both lightweight and fast
GNU Affero General Public License v3.0
503 stars 67 forks source link

[BUG] Batch not working, all requests sequential? #210

Closed SinanAkkoyun closed 1 week ago

SinanAkkoyun commented 1 week ago

OS

Linux

GPU Library

CUDA 12.x

Python version

3.11

Describe the bug

Hey, thank you for the awesome work, I greatly appreciate it! When running the API with default config and then running 10 concurrent API chat requests, no batching is happening at all. All requests run sequentially, although the exllama dynamic generator should be able to process incoming continuous batches

Reproduction steps

Run the latest TabbyAPI server (with default config)

Then, run this script:


from openai import OpenAI
import time
import concurrent.futures
import sys

bsz = int(sys.argv[1]) if len(sys.argv) > 1 else 10
print(f"Batch size of: {bsz} (change with CLI argument)")

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:5000/v1",
)

# Get list of models
models = client.models.list()
model_list = [model.id for model in models.data]

# Print the models for the user to pick
print("Available models:")
for idx, model_id in enumerate(model_list):
    print(f"{idx}: {model_id}")

# Let the user pick a model by entering a number
while True:
    try:
        model_idx = int(input("Pick a model by entering the corresponding number: "))
        if 0 <= model_idx < len(model_list):
            model = model_list[model_idx]
            break
        else:
            model = ''
            break
    except ValueError:
        model = ''
        break

print(f"Selected model: {model}")

def make_request():
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": "What is love?"}],
        stream=False,
        #stop="<|eot_id|>",
        #extra_body={ "stop_token_ids": [128001, 128009] }
    )
    return response.choices[0].message.content, response.usage.completion_tokens

start_time = time.time()

# Create a ThreadPoolExecutor to manage concurrency
with concurrent.futures.ThreadPoolExecutor(max_workers=bsz) as executor:
    # Submit all the tasks and gather futures
    futures = [executor.submit(make_request) for _ in range(bsz)]

    # Wait for all futures to complete
    concurrent.futures.wait(futures)

    total_tokens = 0
    for future in futures:
        # Each future holds a response and the token count
        response_text, tokens_generated = future.result()
        total_tokens += tokens_generated

end_time = time.time()
total_time = end_time - start_time
total_tps = total_tokens / total_time

print("Total Generated Tokens:", total_tokens)
print("Total TPS:", total_tps)
print("Individ. TPS:", (total_tokens/bsz)/total_time)

You will notice that all requests run sequentially (the script is fine, it runs concurrently with vLLM)

Expected behavior

The API should process all API calls simultaneously

Logs

No response

Additional context

No response

Acknowledgements

DocShotgun commented 1 week ago

What are your cache_size and max_seq_len args for the loaded model?

It appears you aren't sending any arguments with the generation request other than stream = False. By default, the argument for max_tokens generated is max_seq_len minus the length of your prompt, which will reserve a full max_seq_len worth of cache on each request. If your cache_size is not greater than max_seq_len in this setting, your prompts will all go sequentially with a max batch size of 1.

By default, cache_size = max_seq_len if not specified, in order to minimize the chance of OOM. You should ideally specify the maximum cache_size that fits in your VRAM without OOM for optimal batching.

SinanAkkoyun commented 1 week ago

Thank you! It now works :)