Closed rrod515 closed 9 months ago
I've had this same issue on an M1 Mac - I've been validating my inference using simulation-based calibration, and if I simulate a dataset of 100 samples and use a model with ~400 params, about 75% of my runs fail with this error.
OS: Sonoma 14.0 CPU: M1 Pro Compiler: clang version 15.0.0 Python: 3.10.8 Pystan: 3.7.0 Httpstan: 4.10.1 (compiled from source) aiohttp: 3.8.4
The full output and error traceback is:
Messages received during sampling:
Gradient evaluation took 0.00027 seconds
1000 transitions using 10 leapfrog steps per transition would take 2.7 seconds.
Adjust your expectations accordingly!
Sampling: 100% (37591/37591)
Traceback (most recent call last):
File "project/scripts/fit_sbc.py", line 53, in <module>
fit = sampler.sample(
File "project/venv/lib/python3.10/site-packages/stan/model.py", line 89, in sample
return self.hmc_nuts_diag_e_adapt(num_chains=num_chains, **kwargs)
File "project/venv/lib/python3.10/site-packages/stan/model.py", line 108, in hmc_nuts_diag_e_adapt
return self._create_fit(function=function, num_chains=num_chains, **kwargs)
File "project/venv/lib/python3.10/site-packages/stan/model.py", line 313, in _create_fit
return asyncio.run(go())
File "project/runners.py", line 44, in run
return loop.run_until_complete(main)
File "project/base_events.py", line 649, in run_until_complete
return future.result()
File "project/venv/lib/python3.10/site-packages/stan/model.py", line 238, in go
resp = await client.get(f"/{fit_name}")
File "project/venv/lib/python3.10/site-packages/stan/common.py", line 48, in get
return HTTPResponse(status=resp.status, content=await resp.read())
File "project/venv/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1037, in read
self._body = await self.content.read()
File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 375, in read
block = await self.readany()
File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 397, in readany
await self._wait("readany")
File "project/venv/lib/python3.10/site-packages/aiohttp/streams.py", line 304, in _wait
await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed
This isn't an error I've seen before. Sounds like some kind of memory or buffer size limit might be being exceeded.
I'm wondering if this is linked to aio-libs/aiohttp#4581, and it occurs for large requests that don't get a response in time? Not sure though
I used a packet sniffer to inspect the traffic going back and forth between Pystan and httpstan; I also had a go at using pdb to try to catch the error as it happened, but the httpstan server had been torn down at that point.
For my example, the fit works for 6000 samples, but fails for 48000 samples.
76878 869.712981 127.0.0.1 127.0.0.1 HTTP 213 GET /v1/models/ex6bdm7x/fits/5nifthnq HTTP/1.1
76879 869.713043 127.0.0.1 127.0.0.1 TCP 56 49954 → 49955 [ACK] Seq=4957307 Ack=2237498 Win=7215 Len=0 TSval=1115740592 TSecr=2511616970
76880 870.393448 127.0.0.1 127.0.0.1 TCP 215 49954 → 49955 [PSH, ACK] Seq=4957307 Ack=2237498 Win=7215 Len=159 TSval=1115741273 TSecr=2511616970 [TCP segment of a reassembled PDU]
76881 870.393478 127.0.0.1 127.0.0.1 TCP 16388 49954 → 49955 [ACK] Seq=4957466 Ack=2237498 Win=7215 Len=16332 TSval=1115741273 TSecr=2511616970 [TCP segment of a reassembled PDU]
There's a 0.7 second delay between when the GET request for the results is issued and when the data begins to be streamed. The response header (in packet 76880) from httpstan is
HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Content-Length: 328436942
Date: Mon, 04 Dec 2023 15:31:50 GMT
Server: Python/3.10 aiohttp/3.8.4
and indicates that ~300MB of data is being streamed (though the packet sniffer seemingly only picks up ~150MB of that). The data streaming begins in 76881 and is complete in 0.3 seconds. I can't see any sort of TCP teardown, but that could well be because the sniffer missed the packets.
For the 48000 sample run (where the ClientPayloadError
is raised), the situation is different:
756363 8238.629726 127.0.0.1 127.0.0.1 HTTP 213 GET /v1/models/ex6bdm7x/fits/kxk3uhsc HTTP/1.1
756364 8238.629795 127.0.0.1 127.0.0.1 TCP 56 50357 → 50358 [ACK] Seq=43239183 Ack=19398529 Win=346816 Len=0 TSval=3598381091 TSecr=247335874
756369 8247.179848 127.0.0.1 127.0.0.1 TCP 216 50357 → 50358 [PSH, ACK] Seq=43239183 Ack=19398529 Win=346816 Len=160 TSval=3598389641 TSecr=247335874 [TCP segment of a reassembled PDU]
756370 8247.179893 127.0.0.1 127.0.0.1 TCP 56 50358 → 50357 [ACK] Seq=19398529 Ack=43239343 Win=285376 Len=0 TSval=247344424 TSecr=3598389641
756371 8247.181520 127.0.0.1 127.0.0.1 HTTP 56 HTTP/1.1 200 OK
756372 8247.181541 127.0.0.1 127.0.0.1 TCP 56 50358 → 50357 [ACK] Seq=19398529 Ack=43239344 Win=285376 Len=0 TSval=247344426 TSecr=3598389643
756373 8247.320002 127.0.0.1 127.0.0.1 TCP 56 50358 → 50357 [FIN, ACK] Seq=19398529 Ack=43239344 Win=285376 Len=0 TSval=247344564 TSecr=3598389643
756374 8247.320090 127.0.0.1 127.0.0.1 TCP 56 50357 → 50358 [ACK] Seq=43239344 Ack=19398530 Win=346816 Len=0 TSval=3598389781 TSecr=247344564
There's 9 seconds before the server responds, but the connection is torn down by ~Pystan~ httpstan before any data is streamed (I can't see a FIN packet, but the FIN-ACK from Pystan is packet 756373). The header returned in packet 756369 is
HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
Content-Length: 3144360715
Date: Mon, 04 Dec 2023 17:34:47 GMT
Server: Python/3.10 aiohttp/3.8.4
indicating that ~3GB of data would be streamed.
It looks like this is fundamentally a httpstan issue (as I have the same issue when using curl to interact with httpstan directly), so I've opened stan-dev/httpstan#652 to summarise what I've found.
Closing this in favor of https://github.com/stan-dev/httpstan/issues/652. As @wm1995 says, this is an httpstan issue.
Describe the bug
When running models with larger data on an Intel-based Mac, the following errors occurs: ClientPayloadError: Response payload is not completed
Describe your system
OS: macOS 10.15.7 (19H2026) CPU: 2.4 GHz 8-Core Intel Core i9 C++: clang++ Apple clang version 11.0.0 (clang-1100.0.33.16) Python: Anaconda conda 4.14.0
Steps/Code to Reproduce
Code Sample, a copy-pastable example