vectorch-ai ScaleLLM issues

vectorch-ai / ScaleLLM

A high-performance inference system for large language models, designed for production environments.

https://docs.vectorch.com/

Apache License 2.0

377 stars 28 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

ut: add more tests for different warp layout

#340 guocuimi opened 1 week ago
0
misc: read flashinfer kernel code and add comments

#339 guocuimi opened 2 weeks ago
0
[misc] read flashinfer kernel code and add comments

#338 guocuimi closed 2 weeks ago
0
ci: added pip cache to avoid redownloading

#337 guocuimi closed 3 weeks ago
0
ut: added fp8 kv unittests for flash infer kernel

#336 guocuimi closed 3 weeks ago
0
refactor: move paged kv related logic into paged_kv_t

#335 guocuimi closed 3 weeks ago
0
feat: added pass-in alibi slopes support for flash infer kernel

#334 guocuimi closed 3 weeks ago
0
refactor: replaced last_page_len with kv_indptr for flash infer kernel

#333 guocuimi closed 3 weeks ago
0
ut: added unittests for flash infer kernels

#332 guocuimi closed 3 weeks ago
0
kernel: port flash infer handler + wrapper logics

#331 guocuimi closed 3 weeks ago
0
refactor: move flash attn and flash infer into attention folder

#330 guocuimi closed 3 weeks ago
0
kernel: added script to generate instantiation for flashinfer kernels

#329 guocuimi closed 4 weeks ago
0
refactor: flatten block tables to 1d tensor

#328 guocuimi closed 4 weeks ago
0
kernel: added flash infer attention impl

#327 guocuimi closed 4 weeks ago
0
feat: fix and use marlin kernel for awq by default

#326 guocuimi closed 4 weeks ago
0
refactor: added static switch for marlin kernel dispatch

#325 guocuimi closed 1 month ago
0
fix: put item into asyncio.Queue in a thread-safe way

#324 guocuimi closed 1 month ago
0
Will the result callback called in a threadsafe/coruntine safe way? #322

#323 tp-nan closed 1 month ago
7
ci: allow build without requiring a physical gpu device

#321 guocuimi closed 1 month ago
0
cmake: make includes private and disable jinja2cpp build

#320 guocuimi closed 1 month ago
0
fix: clean up build warnings: "LOG" redefined

#319 guocuimi closed 1 month ago
0
refactor: clean up build warnings and refactor marlin kernels

#318 guocuimi closed 1 month ago
0
test: added unittests for marlin kernels

#317 guocuimi closed 1 month ago
0
build: speed up compilation for marlin kernels

#316 guocuimi closed 1 month ago
0
feat: added awq marlin qlinear

#315 guocuimi closed 1 month ago
0
kernel: port awq repack kernel

#314 guocuimi closed 1 month ago
0
feat: added fused column parallel linear

#313 guocuimi closed 1 month ago
0
feat: added gptq marlin qlinear layer

#312 guocuimi closed 1 month ago
0
refactor: remove the logic loading individual weight from shared partitions

#311 guocuimi closed 1 month ago
0
RuntimeError: Timed out

#310 spongxin opened 1 month ago
1
rust: upgrade rust libs to latest version

#309 guocuimi closed 1 month ago
0
Mistral large GPTQ model inference problem

#308 drdaliang opened 1 month ago
3
kernel: port gptq marlin kernel and fp8 marlin kernel

#307 guocuimi closed 1 month ago
0
refactor: move models to upper folder

#306 guocuimi closed 1 month ago
0
fix: move eos out of stop token list to honor ignore_eos option

#305 guocuimi closed 1 month ago
0
The process terminated before reaching the specified max_tokens after setting ignore_ros=True and max_tokens.

#304 HowardChenRV closed 1 month ago
3
feat: added marlin qlinear support

#303 guocuimi opened 1 month ago
0
test: added unittests for marlin fp16xint4 gemm

#302 guocuimi closed 1 month ago
0
kernel: support kernel test in python via pybind

#301 guocuimi closed 1 month ago
0
model: added gemma2 with softcap and sliding window support

#300 guocuimi closed 1 month ago
0
test: added unittests for attention sliding window

#299 guocuimi closed 1 month ago
0
kernel: port softcap support for flash attention

#298 guocuimi closed 1 month ago
0
ci: fix pytest version to avoid flakiness

#297 guocuimi closed 2 months ago
0
feat: added sliding window support for QWen2

#296 guocuimi closed 2 months ago
0
model: added qwen2 support

#295 guocuimi closed 2 months ago
0
triton: fix build error and add example with unittest

#294 guocuimi closed 2 months ago
0
fix: handle unfinished utf8 bytes for tiktoken tokenizer

#293 guocuimi closed 2 months ago
0
feat: added THUDM/glm-4* support

#292 guocuimi closed 2 months ago
0
Deployment of glm-4-9b-chat model fails with SentencePiece tokenizer error

#291 dengyingxu closed 2 months ago
4
ci: added clang-format-ignore file to exclude generated files

#290 guocuimi closed 2 months ago
0