Closed stsievert closed 4 years ago
Resolving this will require the following.
rm -rf salmon
; instead, we should implement git pull
and running /bin/bash /home/ubuntu/salmon/ami/salmon.sh
. That will mean any changes to the salmon.sh
script will be reflected in new pulls.salmon
directory inside the Docker container.This would also allow downloading/uploading experiments on different machines.
Here's a traceback I received while load testing after I received 3529 responses:
frontend_1 | Traceback (most recent call last):
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 385, in run_asgi
frontend_1 | result = await app(self.scope, self.receive, self.send)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
frontend_1 | return await self.app(scope, receive, send)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/fastapi/applications.py", line 149, in __call__
frontend_1 | await super().__call__(scope, receive, send)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/starlette/applications.py", line 102, in __call__
frontend_1 | await self.middleware_stack(scope, receive, send)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/starlette/middleware/errors.py", line 181, in __call__
frontend_1 | raise exc from None
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/starlette/middleware/errors.py", line 159, in __call__
frontend_1 | await self.app(scope, receive, _send)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/starlette/exceptions.py", line 82, in __call__
frontend_1 | raise exc from None
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/starlette/exceptions.py", line 71, in __call__
frontend_1 | await self.app(scope, receive, sender)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/starlette/routing.py", line 550, in __call__
frontend_1 | await route.handle(scope, receive, send)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/starlette/routing.py", line 227, in handle
frontend_1 | await self.app(scope, receive, send)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/starlette/routing.py", line 41, in app
frontend_1 | response = await func(request)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/fastapi/routing.py", line 197, in app
frontend_1 | dependant=dependant, values=values, is_coroutine=is_coroutine
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/fastapi/routing.py", line 148, in run_endpoint_function
frontend_1 | return await dependant.call(**values)
frontend_1 | File "./frontend/private.py", line 303, in get_dashboard
frontend_1 | ax.hist(df.response_time, bins="auto")
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/matplotlib/__init__.py", line 1565, in inner
frontend_1 | return func(ax, *map(sanitize_sequence, args), **kwargs)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/matplotlib/axes/_axes.py", line 6649, in hist
frontend_1 | m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
frontend_1 | File "<__array_function__ internals>", line 6, in histogram
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/numpy/lib/histograms.py", line 795, in histogram
frontend_1 | bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/numpy/lib/histograms.py", line 451, in _get_bin_edges
frontend_1 | endpoint=True, dtype=bin_type)
frontend_1 | File "<__array_function__ internals>", line 6, in linspace
frontend_1 | File "/opt/conda/lib/python3.7/site-packages/numpy/core/function_base.py", line 137, in linspace
frontend_1 | y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta))
frontend_1 | MemoryError: Unable to allocate 266. GiB for an array with shape (35664814818,) and data type float64
Here are the consequences of this traceback:
/dashboard
is down (it throws a 500: internal service error). This means /logs
, /get_responses
are still up.When I rebooted this machine, the docker logs stayed in-place. However, the database was not initialized (threw "no data has been uploaded" when [url]
visited).
I think the core of this issue ("machines randomly failing") can be closed. Downloading/restoring experiment state is tracked in #16.
Currently, the experiment state and all intermediate files (i.e., response) are wiped clean after a reboot (and all of Salmon too). It'd be nice to preserve experiment state even after reboot.
Of course, it'd also be nice to figure out why the experiments are randomly re-booting and how to prevent that. I think this will require some investigation on deploying to EC2 to see how much memory/etc is being used.