p.sh containers seem to randomly fail

pirog commented 4 years ago

This problem is described a bit here: https://docs.lando.dev/config/platformsh.html#platformsh-agent-errors and it is currently the biggest blocker for getting to alpha.

Replicating it is a bit tricky because im guessing this is a race condition on some level. However if you lando destroy && lando start, lando rebuild or lando restart enough times you'll eventually notice one of your services (usually the application container) fails. The docker logs on that service usually contains something like

Traceback (most recent call last):
  File "src/gevent/greenlet.py", line 766, in gevent._greenlet.Greenlet.run
  File "/usr/lib/python2.7/dist-packages/gevent_jsonrpc/__init__.py", line 161, in _reader
    f = self._socket.makefile("r+b")
  File "/usr/lib/python2.7/dist-packages/gevent/_socket2.py", line 286, in makefile
    fobj = _fileobject(type(self)(_sock=self), mode, bufsize)
  File "/usr/lib/python2.7/dist-packages/gevent/_socket2.py", line 138, in __init__
    self._sock.setblocking(0)
  File "/usr/lib/python2.7/dist-packages/gevent/_socket2.py", line 90, in _dummy
    raise error(EBADF, 'Bad file descriptor')
error: [Errno 9] Bad file descriptor
2020-06-02T12:47:53Z <Greenlet at 0x7fb7b9cac578: <bound method RpcConnection._reader of <gevent_jsonrpc.RpcConnection object at 0x7fb7b9c23b90>>> failed with error

or

2020-06-02 13:56:19,371 platformsh.agent.service ERROR RPC connection failed: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/platformsh/agent/service.py", line 137, in rpc_call
    client.connect(self.RPC_SOCKET)
  File "/usr/lib/python2.7/dist-packages/gevent_jsonrpc/__init__.py", line 118, in connect
    self._socket.connect(address)
  File "/usr/lib/python2.7/dist-packages/gevent/_socket2.py", line 251, in connect
    raise error(result, strerror(result))

My suspicion here is that there is a race between the agent being ready to receive connections and the /etc/platform/boot|start running.

The best/most consistent evidence i have of this is

lando init the lando-d8 project from platform. or lando destroy it if you already have it pulled down
lando start everything up
lando restart -> you should notice the cache/app services fail with errors like above, and this seems to be pretty consistent regardless of the amount of times you lando restart
Modify https://github.com/lando/lando/blob/master/experimental/plugins/lando-platformsh/scripts/psh-boot.sh#L34-L36 and add a sleep 5 to the bottom
```
python /helpers/psh-fake-rpc.py &> /tmp/fake-rpc.log
sleep 5
```
lando restart -> seems to work as expected

If we feel like this is the actual problem then we should add some more sophisticated logic to psh-boot.sh so that it waits until the socket is ready. Eg something like

# Handle the socket setup
# Clean it up if it still exists
if [ -S "$LANDO_PSH_AGENT_SOCKET" ]; then
  rm -f "$LANDO_PSH_AGENT_SOCKET"
fi
# Start it up
python /helpers/psh-fake-rpc.py &> /tmp/fake-rpc.log
# Wait until its ready
while [ ! -S  "$LANDO_PSH_AGENT_SOCKET" ]; do
  lando_debug "Waiting for $LANDO_PSH_AGENT_SOCKET to be ready..."
  sleep 1
done

pirog commented 4 years ago

Hard to be sure but waiting for the socket to be ready first definitely seems to improve things significantly. Have not yet had a failure.

mikemilano commented 4 years ago

@pirog if you push it up with a branch, I'll run/test it.

pirog commented 4 years ago

@mikemilano sounds good.

lando / platformsh

p.sh containers seem to randomly fail #88