stan-dev / httpstan

HTTP interface to Stan, a package for Bayesian inference.
ISC License
40 stars 15 forks source link

httpstan fails to give a response for large outputs #652

Open wm1995 opened 10 months ago

wm1995 commented 10 months ago

This is following on from stan-dev/pystan#396, which ultimately seems to root from an httpstan issue. I'm working on an MWE, but the case I've identified so far seems to happen with a very large dataset. I used curl to interact with httpstan, to isolate whether it was a Pystan or httpstan error.

In the case I've identified, I'm able to compile and sample a large model, which appears to complete successfully according to the operation endpoint. However, when I do the final GET request to retrieve the model results, httpstan sends a header for the response, but then closes the connection without sending any data. There's no visible error on the console running httpstan (which I started using python -m httpstan), but the curl request fails:

$ curl localhost:8080/v1/models/ex6bdm7x/fits/wxft2x3c
curl: (18) transfer closed with 3144360709 bytes remaining to read

and packet sniffing suggests that httpstan is closing the connection:

12  5.048341    127.0.0.1   127.0.0.1   TCP 68  54876 → 8080 [SYN] Seq=0 Win=65535 Len=0 MSS=16344 WS=64 TSval=3703300602 TSecr=0 SACK_PERM
13  5.048735    127.0.0.1   127.0.0.1   TCP 68  8080 → 54876 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=16344 WS=64 TSval=157578390 TSecr=3703300602 SACK_PERM
14  5.048759    127.0.0.1   127.0.0.1   TCP 56  54876 → 8080 [ACK] Seq=1 Ack=1 Win=408256 Len=0 TSval=3703300602 TSecr=157578390
15  5.048771    127.0.0.1   127.0.0.1   TCP 56  [TCP Window Update] 8080 → 54876 [ACK] Seq=1 Ack=1 Win=408256 Len=0 TSval=157578390 TSecr=3703300602
16  5.048832    127.0.0.1   127.0.0.1   HTTP    165 GET /v1/models/ex6bdm7x/fits/wxft2x3c HTTP/1.1 
17  5.048851    127.0.0.1   127.0.0.1   TCP 56  8080 → 54876 [ACK] Seq=1 Ack=110 Win=408128 Len=0 TSval=157578390 TSecr=3703300602
34  17.511368   127.0.0.1   127.0.0.1   TCP 216 8080 → 54876 [PSH, ACK] Seq=1 Ack=110 Win=408128 Len=160 TSval=157590864 TSecr=3703300602 [TCP segment of a reassembled PDU]
35  17.511463   127.0.0.1   127.0.0.1   TCP 56  54876 → 8080 [ACK] Seq=110 Ack=161 Win=408128 Len=0 TSval=3703313076 TSecr=157590864
36  17.513067   127.0.0.1   127.0.0.1   HTTP    56  HTTP/1.1 200 OK 
37  17.513113   127.0.0.1   127.0.0.1   TCP 56  54876 → 8080 [ACK] Seq=110 Ack=162 Win=408128 Len=0 TSval=3703313077 TSecr=157590865
38  17.513293   127.0.0.1   127.0.0.1   TCP 56  54876 → 8080 [FIN, ACK] Seq=110 Ack=162 Win=408128 Len=0 TSval=3703313077 TSecr=157590865
39  17.513350   127.0.0.1   127.0.0.1   TCP 56  8080 → 54876 [ACK] Seq=162 Ack=111 Win=408128 Len=0 TSval=157590865 TSecr=3703313077

Packet 36 has [FIN, ACK] set.

The corresponding TCP stream is

GET /v1/models/ex6bdm7x/fits/wxft2x3c HTTP/1.1
Host: localhost:8080
User-Agent: curl/8.1.2
Accept: */*

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Content-Length: 3144360709
Date: Wed, 06 Dec 2023 16:26:16 GMT
Server: Python/3.10 aiohttp/3.8.4
riddell-stan commented 9 months ago

I wonder if there isn't an aiohttp timeout that's being reached. You mentioned earlier that the data transfer didn't start for 9 seconds. That seems like a long time.

There are lots of timeouts in https://docs.aiohttp.org/en/stable/client_reference.html.