ClientPayloadError: Response payload is not completed

rrod515 commented 11 months ago

Describe the bug

When running models with larger data on an Intel-based Mac, the following errors occurs: ClientPayloadError: Response payload is not completed

Describe your system

OS: macOS 10.15.7 (19H2026) CPU: 2.4 GHz 8-Core Intel Core i9 C++: clang++ Apple clang version 11.0.0 (clang-1100.0.33.16) Python: Anaconda conda 4.14.0

Steps/Code to Reproduce

Code Sample, a copy-pastable example

import stan
import numpy as np

schools_code = """
data {
  int<lower=0> J;         // number of schools
  array[J] real y;              // estimated treatment effects
  array[J] real<lower=0> sigma; // standard error of effect estimates
}
parameters {
  real mu;                // population treatment effect
  real<lower=0> tau;      // standard deviation in treatment effects
  vector[J] eta;          // unscaled deviation from mu by school
}
transformed parameters {
  vector[J] theta = mu + tau * eta;        // school treatment effects
}
model {
  target += normal_lpdf(eta | 0, 1);       // prior log-density
  target += normal_lpdf(y | theta, sigma); // log-likelihood
}
"""

schools_data = {"J": 40000,
                "y": np.repeat([28,  8, -3,  7, -1,  1, 18, 12], 5000),
                "sigma": np.repeat([15, 10, 16, 11,  9, 11, 10, 18], 5000)}

posterior = stan.build(schools_code, data=schools_data)
fit = posterior.sample(num_chains=4, num_samples=1000)
eta = fit["eta"]  # array with shape (8, 4000)
df = fit.to_frame()  # pandas `DataFrame, requires pandas

wm1995 commented 9 months ago

I've had this same issue on an M1 Mac - I've been validating my inference using simulation-based calibration, and if I simulate a dataset of 100 samples and use a model with ~400 params, about 75% of my runs fail with this error.

OS: Sonoma 14.0 CPU: M1 Pro Compiler: clang version 15.0.0 Python: 3.10.8 Pystan: 3.7.0 Httpstan: 4.10.1 (compiled from source) aiohttp: 3.8.4

The full output and error traceback is:

Messages received during sampling:
  Gradient evaluation took 0.00027 seconds
  1000 transitions using 10 leapfrog steps per transition would take 2.7 seconds.
  Adjust your expectations accordingly!
Sampling: 100% (37591/37591)
Traceback (most recent call last):
  File "project/scripts/fit_sbc.py", line 53, in <module>
    fit = sampler.sample(
  File "project/venv/lib/python3.10/site-packages/stan/model.py", line 89, in sample
    return self.hmc_nuts_diag_e_adapt(num_chains=num_chains, **kwargs)
  File "project/venv/lib/python3.10/site-packages/stan/model.py", line 108, in hmc_nuts_diag_e_adapt
    return self._create_fit(function=function, num_chains=num_chains, **kwargs)
  File "project/venv/lib/python3.10/site-packages/stan/model.py", line 313, in _create_fit
    return asyncio.run(go())
  File "project/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "project/base_events.py", line 649, in run_until_complete
    return future.result()
  File "project/venv/lib/python3.10/site-packages/stan/model.py", line 238, in go
    resp = await client.get(f"/{fit_name}")
  File "project/venv/lib/python3.10/site-packages/stan/common.py", line 48, in get
    return HTTPResponse(status=resp.status, content=await resp.read())
  File "project/venv/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1037, in read
    self._body = await self.content.read()
  File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 375, in read
    block = await self.readany()
  File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 397, in readany
    await self._wait("readany")
  File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

riddell-stan commented 9 months ago

This isn't an error I've seen before. Sounds like some kind of memory or buffer size limit might be being exceeded.

wm1995 commented 9 months ago

I'm wondering if this is linked to aio-libs/aiohttp#4581, and it occurs for large requests that don't get a response in time? Not sure though

wm1995 commented 9 months ago

I used a packet sniffer to inspect the traffic going back and forth between Pystan and httpstan; I also had a go at using pdb to try to catch the error as it happened, but the httpstan server had been torn down at that point.

For my example, the fit works for 6000 samples, but fails for 48000 samples.

76878   869.712981  127.0.0.1   127.0.0.1   HTTP    213 GET /v1/models/ex6bdm7x/fits/5nifthnq HTTP/1.1 
76879   869.713043  127.0.0.1   127.0.0.1   TCP 56  49954 → 49955 [ACK] Seq=4957307 Ack=2237498 Win=7215 Len=0 TSval=1115740592 TSecr=2511616970
76880   870.393448  127.0.0.1   127.0.0.1   TCP 215 49954 → 49955 [PSH, ACK] Seq=4957307 Ack=2237498 Win=7215 Len=159 TSval=1115741273 TSecr=2511616970 [TCP segment of a reassembled PDU]
76881   870.393478  127.0.0.1   127.0.0.1   TCP 16388   49954 → 49955 [ACK] Seq=4957466 Ack=2237498 Win=7215 Len=16332 TSval=1115741273 TSecr=2511616970 [TCP segment of a reassembled PDU]

There's a 0.7 second delay between when the GET request for the results is issued and when the data begins to be streamed. The response header (in packet 76880) from httpstan is

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Content-Length: 328436942
Date: Mon, 04 Dec 2023 15:31:50 GMT
Server: Python/3.10 aiohttp/3.8.4

and indicates that ~300MB of data is being streamed (though the packet sniffer seemingly only picks up ~150MB of that). The data streaming begins in 76881 and is complete in 0.3 seconds. I can't see any sort of TCP teardown, but that could well be because the sniffer missed the packets.

For the 48000 sample run (where the ClientPayloadError is raised), the situation is different:

756363  8238.629726 127.0.0.1   127.0.0.1   HTTP    213 GET /v1/models/ex6bdm7x/fits/kxk3uhsc HTTP/1.1 
756364  8238.629795 127.0.0.1   127.0.0.1   TCP 56  50357 → 50358 [ACK] Seq=43239183 Ack=19398529 Win=346816 Len=0 TSval=3598381091 TSecr=247335874
756369  8247.179848 127.0.0.1   127.0.0.1   TCP 216 50357 → 50358 [PSH, ACK] Seq=43239183 Ack=19398529 Win=346816 Len=160 TSval=3598389641 TSecr=247335874 [TCP segment of a reassembled PDU]
756370  8247.179893 127.0.0.1   127.0.0.1   TCP 56  50358 → 50357 [ACK] Seq=19398529 Ack=43239343 Win=285376 Len=0 TSval=247344424 TSecr=3598389641
756371  8247.181520 127.0.0.1   127.0.0.1   HTTP    56  HTTP/1.1 200 OK 
756372  8247.181541 127.0.0.1   127.0.0.1   TCP 56  50358 → 50357 [ACK] Seq=19398529 Ack=43239344 Win=285376 Len=0 TSval=247344426 TSecr=3598389643
756373  8247.320002 127.0.0.1   127.0.0.1   TCP 56  50358 → 50357 [FIN, ACK] Seq=19398529 Ack=43239344 Win=285376 Len=0 TSval=247344564 TSecr=3598389643
756374  8247.320090 127.0.0.1   127.0.0.1   TCP 56  50357 → 50358 [ACK] Seq=43239344 Ack=19398530 Win=346816 Len=0 TSval=3598389781 TSecr=247344564

There's 9 seconds before the server responds, but the connection is torn down by ~Pystan~ httpstan before any data is streamed (I can't see a FIN packet, but the FIN-ACK from Pystan is packet 756373). The header returned in packet 756369 is

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Content-Length: 3144360715
Date: Mon, 04 Dec 2023 17:34:47 GMT
Server: Python/3.10 aiohttp/3.8.4

indicating that ~3GB of data would be streamed.

wm1995 commented 9 months ago

It looks like this is fundamentally a httpstan issue (as I have the same issue when using curl to interact with httpstan directly), so I've opened stan-dev/httpstan#652 to summarise what I've found.

riddell-stan commented 9 months ago

Closing this in favor of https://github.com/stan-dev/httpstan/issues/652. As @wm1995 says, this is an httpstan issue.

stan-dev / pystan