Machines are randomly rebooting (and experiment data getting lost!)

stsievert commented 4 years ago

Currently, the experiment state and all intermediate files (i.e., response) are wiped clean after a reboot (and all of Salmon too). It'd be nice to preserve experiment state even after reboot.

Of course, it'd also be nice to figure out why the experiments are randomly re-booting and how to prevent that. I think this will require some investigation on deploying to EC2 to see how much memory/etc is being used.

stsievert commented 4 years ago

Resolving this will require the following.

Creating a new AMI. Currently, the AMI script runs rm -rf salmon; instead, we should implement git pull and running /bin/bash /home/ubuntu/salmon/ami/salmon.sh. That will mean any changes to the salmon.sh script will be reflected in new pulls.
Dumping the Redis state to disk.
Loading the Redis state from disk.
Mirroring the salmon directory inside the Docker container.

stsievert commented 4 years ago

This would also allow downloading/uploading experiments on different machines.

stsievert commented 4 years ago

Here's a traceback I received while load testing after I received 3529 responses:

frontend_1  | Traceback (most recent call last):
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/uvicorn/protocols/http/httptools_impl.py", line 385, in run_asgi
frontend_1  |     result = await app(self.scope, self.receive, self.send)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
frontend_1  |     return await self.app(scope, receive, send)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/fastapi/applications.py", line 149, in __call__
frontend_1  |     await super().__call__(scope, receive, send)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/starlette/applications.py", line 102, in __call__
frontend_1  |     await self.middleware_stack(scope, receive, send)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/starlette/middleware/errors.py", line 181, in __call__
frontend_1  |     raise exc from None
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/starlette/middleware/errors.py", line 159, in __call__
frontend_1  |     await self.app(scope, receive, _send)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/starlette/exceptions.py", line 82, in __call__
frontend_1  |     raise exc from None
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/starlette/exceptions.py", line 71, in __call__
frontend_1  |     await self.app(scope, receive, sender)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/starlette/routing.py", line 550, in __call__
frontend_1  |     await route.handle(scope, receive, send)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/starlette/routing.py", line 227, in handle
frontend_1  |     await self.app(scope, receive, send)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/starlette/routing.py", line 41, in app
frontend_1  |     response = await func(request)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/fastapi/routing.py", line 197, in app
frontend_1  |     dependant=dependant, values=values, is_coroutine=is_coroutine
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/fastapi/routing.py", line 148, in run_endpoint_function
frontend_1  |     return await dependant.call(**values)
frontend_1  |   File "./frontend/private.py", line 303, in get_dashboard
frontend_1  |     ax.hist(df.response_time, bins="auto")
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/matplotlib/__init__.py", line 1565, in inner
frontend_1  |     return func(ax, *map(sanitize_sequence, args), **kwargs)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/matplotlib/axes/_axes.py", line 6649, in hist
frontend_1  |     m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
frontend_1  |   File "<__array_function__ internals>", line 6, in histogram
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/numpy/lib/histograms.py", line 795, in histogram
frontend_1  |     bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights)
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/numpy/lib/histograms.py", line 451, in _get_bin_edges
frontend_1  |     endpoint=True, dtype=bin_type)
frontend_1  |   File "<__array_function__ internals>", line 6, in linspace
frontend_1  |   File "/opt/conda/lib/python3.7/site-packages/numpy/core/function_base.py", line 137, in linspace
frontend_1  |     y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta))
frontend_1  | MemoryError: Unable to allocate 266. GiB for an array with shape (35664814818,) and data type float64

Here are the consequences of this traceback:

Only /dashboard is down (it throws a 500: internal service error). This means /logs, /get_responses are still up.
The query page is still responsive.

stsievert commented 4 years ago

When I rebooted this machine, the docker logs stayed in-place. However, the database was not initialized (threw "no data has been uploaded" when [url] visited).

I think the core of this issue ("machines randomly failing") can be closed. Downloading/restoring experiment state is tracked in #16.

stsievert / salmon

Machines are randomly rebooting (and experiment data getting lost!) #10