tenstorrent vllm issues

tenstorrent / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

5 stars 1 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

[Feature]: Support for logprobs sampling parameter in TT backend

#37 milank94 opened 3 days ago
1
[Bug]: vLLM model backend crashed when running single user prompts, less than 128 token input context

#36 tstescoTT opened 1 week ago
2
[Bug]: online server fails when prompts with greater context than 2048 are sent.

#35 tstescoTT closed 18 hours ago
4
Update vLLM commit in tt-metal readme

#34 skhorasganiTT closed 2 weeks ago
1
Update TTModelRunner due to decode rope changes for llama70b

#33 skhorasganiTT closed 2 weeks ago
1
[Bugfix] #31 _make_sampler_output return expected SequenceOutput output_token: int

#32 tstescoTT closed 3 weeks ago
1
[Bug]: vLLM openai api_server errors out when stream=True

#31 tstescoTT closed 3 weeks ago
2
Import tt-metal model via pythonpath instead of symlink

#30 skhorasganiTT closed 4 weeks ago
1
[Bug] vLLM server fails when requests with different temperatures are sent

#29 cglagovichTT opened 4 weeks ago
0
Update vLLM commit in tt_metal README.md

#28 skhorasganiTT closed 4 weeks ago
1
Modify async_out_proc + multi-step for trace mode to trigger output processor for step (i) when running step (i+1)

#27 skhorasganiTT closed 4 weeks ago
1
Add multi-step decoding, async output proc options to TTModelRunner, and inference examples with Async LLMEngine

#26 skhorasganiTT closed 1 month ago
1
[Hardware][Tenstorrent] Modify offline_inference_tt.py to include max_tokens arg

#25 milank94 closed 1 month ago
1
Update vLLM commit in README.md

#24 skhorasganiTT closed 1 month ago
1
Update vLLM and tt-metal commits in metal README.md

#23 skhorasganiTT closed 1 month ago
0
Cache kv blocks for faster initialization, modify model and cache args to allow for higher seq lens

#22 skhorasganiTT closed 1 month ago
0
Remove outdated warning for async model execution

#21 skhorasganiTT closed 1 month ago
0
Update vLLM commit and perf prompt length in metal README

#20 skhorasganiTT closed 1 month ago
0
Add trace_mode option to TTWorker and TTModelRunner

#19 skhorasganiTT closed 1 month ago
0
Specify mesh_type when opening t3k device, enable async call in TTExecutorAsync, update tt-metal commit

#18 skhorasganiTT closed 1 month ago
0
[Feature]: test issue

#17 uaydonat closed 1 month ago
0
Add performance measurement option with fixed prompt size to tt inference example

#16 skhorasganiTT closed 1 month ago
0
[Misc]: Investigate issues with async execution of TT-llama for serving

#15 skhorasganiTT closed 1 month ago
2
[Feature]: Add option to enable tracing for TT-llama

#14 skhorasganiTT closed 1 month ago
0
Update vLLM commit in metal README

#13 skhorasganiTT closed 2 months ago
0
Prefill and decode stat tracking, initial implementations of TTPlatform, TTExecutorAsync

#12 skhorasganiTT closed 2 months ago
0
Add top-k top-p sampling and clean up input preparation

#11 skhorasganiTT closed 2 months ago
0
Update tt-metal commit in README

#10 skhorasganiTT closed 2 months ago
0
Update tt-metal and vLLM commits in README, add additional model setup instructions

#9 skhorasganiTT closed 2 months ago
0
Improve instructions for tt-metal environment creation

#8 skhorasganiTT closed 2 months ago
0
Pad batches to max_num_seqs for decode

#7 skhorasganiTT closed 2 months ago
0
Clean up inference example and add more generation args, fix opening devices, assert against batch sizes

#6 skhorasganiTT closed 2 months ago
0
Add support for different prompt lengths

#5 skhorasganiTT closed 2 months ago
0
Update metal README with branch instructions, update inference example with different prompts (same seq len)

#4 skhorasganiTT closed 2 months ago
0
CacheEngine implemented, paged_attention working when all prompts are…

#3 cglagovichTT closed 2 months ago
1
Cglagovich/initial paged kv

#2 cglagovichTT closed 2 months ago
0
Add initial implementation of TT backend (Executor, Worker, ModelRunner, ModelLoader) with basic llama generation example

#1 skhorasganiTT closed 2 months ago
0