Stream responses and display in Flask

Your current environment

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04 LTS (x86_64)
GCC version: (Ubuntu 13.2.0-23ubuntu4) 13.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.12.3 (main, Apr 10 2024, 05:33:47) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-1012-aws-x86_64-with-glibc2.39
Is CUDA available: N/A
CUDA runtime version: 12.0.140
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration:
GPU 0: Tesla T4
GPU 1: Tesla T4
GPU 2: Tesla T4
GPU 3: Tesla T4

Nvidia driver version: 555.42.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               48
On-line CPU(s) list:                  0-47
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             7
BogoMIPS:                             4999.98
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            768 KiB (24 instances)
L1i cache:                            768 KiB (24 instances)
L2 cache:                             24 MiB (24 instances)
L3 cache:                             35.8 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-47
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] No relevant packages
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     PHB     0-47    0               N/A
GPU1    PHB      X      PHB     PHB     0-47    0               N/A
GPU2    PHB     PHB      X      PHB     0-47    0               N/A
GPU3    PHB     PHB     PHB      X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

I am using the vllm API server with the following setup:

python -m vllm.entrypoints.api_server --model=mistralai/Mistral-7B-Instruct-v0.3 --dtype=half --tensor-parallel-size=4 --gpu-memory-utilization=0.5 --max-model-len=27000

I am sending requests to the server using this Python function:

def send_request_2_llm(prompt: str):
    url = "http://localhost:8000/generate"
    if len(prompt) > 27_000:
        prompt = prompt[:27_000]
    payload = {
        "prompt": prompt,
        "stream": True,
        "min_tokens": 256,
        "max_tokens": 1024
    }
    response = requests.post(url, json=payload, stream=True)
    return response

I want to display the streamed response on my Flask app's screen. The issue I'm encountering is with the structure of the streamed responses. The API server returns the response in a sequence of JSON objects like this:

{"text": "SYSTEM_PROMPT + hello"}
{"text": "SYSTEM_PROMPT + hello how"}
{"text": "SYSTEM_PROMPT + hello how are"}
{"text": "SYSTEM_PROMPT + hello how are you"}
{"text": "SYSTEM_PROMPT + hello how are you?"}

On my Flask app, I want to print only the final text ("hello how are you?") on a single line, in a streaming fashion. I believe I can slice the "text" by SYSTEM_PROMPT, but I'm unsure how to do this correctly.

Here is the JavaScript code I am using to handle the streaming:

function askQuestion() {
    const fileInput = document.getElementById('fileInput');
    const questionInput = document.getElementById('questionInput');
    const responseDiv = document.getElementById('response');

    const formData = new FormData();
    formData.append('file', fileInput.files[0]);
    formData.append('question', questionInput.value);

    responseDiv.innerHTML = '';  // Clear previous response

    fetch('/upload', {
        method: 'POST',
        body: formData
    })
    .then(response => {
        if (!response.ok) {
            throw new Error(`HTTP error! status: ${response.status}`);
        }
        const reader = response.body.getReader();
        const decoder = new TextDecoder();

        return new ReadableStream({
            start(controller) {
                function push() {
                    reader.read().then(({ done, value }) => {
                        if (done) {
                            controller.close();
                            return;
                        }
                        const chunk = decoder.decode(value, { stream: true });
                        console.log("Received chunk:", chunk);  // Debug log
                        controller.enqueue(chunk);
                        responseDiv.innerHTML += chunk;
                        push();
                    }).catch(error => {
                        console.error('Stream reading error:', error);
                        controller.error(error);
                    });
                }
                push();
            }
        });
    })
    .then(stream => new Response(stream).text())
    .then(result => {
        console.log('Complete response received');
    })
    .catch(error => {
        console.error('Error:', error);
        responseDiv.innerHTML = 'An error occurred while processing your request.';
    });
}

My Questions:

How can I correctly slice out the SYSTEM_PROMPT from the "text" field and display only the final text?
How can I implement streaming in a way that ensures the response is updated on the screen in real-time, without showing intermediate fragments?

Any advice or guidance would be greatly appreciated!

vllm-project / vllm

Stream responses and display in Flask #7214

Your current environment

How would you like to use vllm