issues
search
tenstorrent
/
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
5
stars
1
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[Feature]: Support for logprobs sampling parameter in TT backend
#37
milank94
opened
3 days ago
1
[Bug]: vLLM model backend crashed when running single user prompts, less than 128 token input context
#36
tstescoTT
opened
1 week ago
2
[Bug]: online server fails when prompts with greater context than 2048 are sent.
#35
tstescoTT
closed
18 hours ago
4
Update vLLM commit in tt-metal readme
#34
skhorasganiTT
closed
2 weeks ago
1
Update TTModelRunner due to decode rope changes for llama70b
#33
skhorasganiTT
closed
2 weeks ago
1
[Bugfix] #31 _make_sampler_output return expected SequenceOutput output_token: int
#32
tstescoTT
closed
3 weeks ago
1
[Bug]: vLLM openai api_server errors out when stream=True
#31
tstescoTT
closed
3 weeks ago
2
Import tt-metal model via pythonpath instead of symlink
#30
skhorasganiTT
closed
4 weeks ago
1
[Bug] vLLM server fails when requests with different temperatures are sent
#29
cglagovichTT
opened
4 weeks ago
0
Update vLLM commit in tt_metal README.md
#28
skhorasganiTT
closed
4 weeks ago
1
Modify async_out_proc + multi-step for trace mode to trigger output processor for step (i) when running step (i+1)
#27
skhorasganiTT
closed
4 weeks ago
1
Add multi-step decoding, async output proc options to TTModelRunner, and inference examples with Async LLMEngine
#26
skhorasganiTT
closed
1 month ago
1
[Hardware][Tenstorrent] Modify offline_inference_tt.py to include max_tokens arg
#25
milank94
closed
1 month ago
1
Update vLLM commit in README.md
#24
skhorasganiTT
closed
1 month ago
1
Update vLLM and tt-metal commits in metal README.md
#23
skhorasganiTT
closed
1 month ago
0
Cache kv blocks for faster initialization, modify model and cache args to allow for higher seq lens
#22
skhorasganiTT
closed
1 month ago
0
Remove outdated warning for async model execution
#21
skhorasganiTT
closed
1 month ago
0
Update vLLM commit and perf prompt length in metal README
#20
skhorasganiTT
closed
1 month ago
0
Add trace_mode option to TTWorker and TTModelRunner
#19
skhorasganiTT
closed
1 month ago
0
Specify mesh_type when opening t3k device, enable async call in TTExecutorAsync, update tt-metal commit
#18
skhorasganiTT
closed
1 month ago
0
[Feature]: test issue
#17
uaydonat
closed
1 month ago
0
Add performance measurement option with fixed prompt size to tt inference example
#16
skhorasganiTT
closed
1 month ago
0
[Misc]: Investigate issues with async execution of TT-llama for serving
#15
skhorasganiTT
closed
1 month ago
2
[Feature]: Add option to enable tracing for TT-llama
#14
skhorasganiTT
closed
1 month ago
0
Update vLLM commit in metal README
#13
skhorasganiTT
closed
2 months ago
0
Prefill and decode stat tracking, initial implementations of TTPlatform, TTExecutorAsync
#12
skhorasganiTT
closed
2 months ago
0
Add top-k top-p sampling and clean up input preparation
#11
skhorasganiTT
closed
2 months ago
0
Update tt-metal commit in README
#10
skhorasganiTT
closed
2 months ago
0
Update tt-metal and vLLM commits in README, add additional model setup instructions
#9
skhorasganiTT
closed
2 months ago
0
Improve instructions for tt-metal environment creation
#8
skhorasganiTT
closed
2 months ago
0
Pad batches to max_num_seqs for decode
#7
skhorasganiTT
closed
2 months ago
0
Clean up inference example and add more generation args, fix opening devices, assert against batch sizes
#6
skhorasganiTT
closed
2 months ago
0
Add support for different prompt lengths
#5
skhorasganiTT
closed
2 months ago
0
Update metal README with branch instructions, update inference example with different prompts (same seq len)
#4
skhorasganiTT
closed
2 months ago
0
CacheEngine implemented, paged_attention working when all prompts are…
#3
cglagovichTT
closed
2 months ago
1
Cglagovich/initial paged kv
#2
cglagovichTT
closed
2 months ago
0
Add initial implementation of TT backend (Executor, Worker, ModelRunner, ModelLoader) with basic llama generation example
#1
skhorasganiTT
closed
2 months ago
0