juju / python-libjuju

Python library for the Juju API
Apache License 2.0
59 stars 100 forks source link

juju.model.block_until() disconnects/closes in 'busy' models #458

Closed ajkavanagh closed 1 year ago

ajkavanagh commented 3 years ago

See zaza bug: https://github.com/openstack-charmers/zaza/issues/402

Essentially, what seems to happen is that the get status feature of libjuju may disconnect/close the connection to the controller. This generates an Exception (websockets.exceptions.ConnectionClosed) which probably means it's the underlying websocket connection. This may be, therefore, a Juju bug at the controller, rather than a libjuju bug, but starting here as a way to track it.

In zaza I will wrap it with a retry to see if it's a 'one-off' or a permanent condition (i.e. persists over a period of time). I'll report back here once I have more information.

SimonRichardson commented 3 years ago

This implies that something broke when requesting get_status, this could be for a number of reasons:

  1. The FullStatus request is too large and causes the websocket to die
  2. The FullStatus request takes to long because it's too large and causes the websocket to die
  3. Juju broke the output to FullStatus
  4. The websocket closed because of inactivity and it might be worth pinging the controller more frequently to ensure uptime?

What version of Juju is it running, is it consistently failing at the same point or more of an intermittent failure? For 1 or 2, does running juju status also work correctly? If it is because of the payload maybe we can change the MAX_FRAME_SIZE or the timeout...

ajkavanagh commented 3 years ago

Yes, I've seen 1. before an enlarged the frame size to cope. I don't think it is that here, but I will check. i.e. it passes in other scenarios. It's definitely not 4. as this is during a CI run and we touch get_status() a lot.

Version of juju is 2.7.8.

It doesn't fail at the exactly the same point but it is constantly checking whether the the model has finished its deployment. I'll grab a full stack trace for the next time it happens at add it here.

lourot commented 3 years ago

We're also seeing this often on our OpenStack Charmers' CI system with Juju 2.8. Here is the stack trace: https://paste.ubuntu.com/p/Q5TdRNBqt7/

ajkavanagh commented 3 years ago

More examples on this bug:

From an OSCI run:

juju 2.8.8 (now happening VERY frequently with this version):

versions:

Juju2.8.8-xenial-amd64 juju-wait==2.8.4 python_libjuju (juju==2.8.5) Python 3.5 (xenial)

03:25:29 2021-02-15 03:25:25 [ERROR] Error in watcher
03:25:29 Traceback (most recent call last):
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/model.py", line 920, in _all_watcher
03:25:29     await allwatcher.Stop()
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/client/facade.py", line 480, in wrapper
03:25:29     reply = await f(*args, **kwargs)
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/client/_client1.py", line 237, in Stop
03:25:29     reply = await self.rpc(msg)
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/client/overrides.py", line 106, in rpc
03:25:29     result = await self.connection.rpc(msg, encoder=TypeEncoder)
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/client/connection.py", line 476, in rpc
03:25:29     raise errors.JujuAPIError(result)
03:25:29 juju.errors.JujuAPIError: unknown watcher id
03:25:29 2021-02-15 03:25:25 [ERROR] Task exception was never retrieved
03:25:29 future: <Task finished coro=<Model._watch.<locals>._all_watcher() done, defined at /var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/model.py:870> exception=JujuAPIError('unknown watcher id',)>
03:25:29 Traceback (most recent call last):
03:25:29   File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
03:25:29     result = coro.send(None)
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/model.py", line 920, in _all_watcher
03:25:29     await allwatcher.Stop()
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/client/facade.py", line 480, in wrapper
03:25:29     reply = await f(*args, **kwargs)
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/client/_client1.py", line 237, in Stop
03:25:29     reply = await self.rpc(msg)
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/client/overrides.py", line 106, in rpc
03:25:29     result = await self.connection.rpc(msg, encoder=TypeEncoder)
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/client/connection.py", line 476, in rpc
03:25:29     raise errors.JujuAPIError(result)
03:25:29 juju.errors.JujuAPIError: unknown watcher id
03:25:29 /var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 3.5 support will be dropped in the next release of cryptography. Please upgrade your Python.
03:25:29   from cryptography.hazmat.backends import default_backend
03:25:29 Traceback (most recent call last):
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/zaza/model.py", line 1149, in async_wait_for_application_states
03:25:29     timeout=timeout)
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/model.py", line 732, in block_until
03:25:29     raise websockets.ConnectionClosed(1006, 'no reason')
03:25:29 websockets.exceptions.ConnectionClosed: WebSocket connection is closed: code = 1006 (connection closed abnormally [internal]), reason = no reason
SimonRichardson commented 3 years ago

This version of juju or this version of pylibjuju?

ajkavanagh commented 3 years ago

Hi Simon

This version of juju or this version of pylibjuju?

Juju 2.8.8-xenial-amd64 juju-wait==2.8.4 python_libjuju (juju==2.8.5) Python 3.5 (xenial)

johnsca commented 3 years ago

This implies that something broke when requesting get_status

FWIW, block_until doesn't call Model.get_status. The architecture of libjuju is to have a persistent AllWatcher and to use the incoming deltas to incrementally update the model status. The only times FullStatus is called are:

More examples on this bug:

03:25:29 Traceback (most recent call last):
03:25:29   File "/var/lib/jenkins/tools/0/charm-test-infra/.tox/clients/lib/python3.5/site-packages/juju/model.py", line 920, in _all_watcher
03:25:29     await allwatcher.Stop()

That line would seem to imply that Model.disconnect was called, although the "unknown watcher id" is strange, as if either Juju lost track of the AllWatcher that the connection was using or libjuju tried to stop the watcher more than once.

pengale commented 3 years ago

Per the discussion in the Openstack office hours today, this might be due to the STDOUT buffer getting full.

That said, I'm not sure why we'd see a disconnect, rather than a more informative error. It is possible that the error is getting swallowed somewhere in the pipeline, though.

ajkavanagh commented 3 years ago

So I've taken to working around the issue in zaza here: https://github.com/openstack-charmers/zaza/pull/439

Essentially, I don't use juju.block_until() and instead have re-written it to a similar function that detects the disconnection condition and just reconnects the model (which is generally what the tests want to do).

Here is another example (juju=2.8.10, libjuju=2.8.6) which display the same "i got disconnected" issue.

2021-04-14 11:03:02 [WARNING] RPC: Connection closed, reconnecting                                                                                                                                                                            [1099/54682]
2021-04-14 11:03:02 [ERROR] Error in watcher                                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                                               
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/model.py", line 920, in _all_watcher                                                                             
    await allwatcher.Stop()                                                                                                                                                                                                                               
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 480, in wrapper                                                                                                                                           
    reply = await f(*args, **kwargs)                                                                                                                                                                                                                      
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/_client1.py", line 237, in Stop                                                                                                   
    reply = await self.rpc(msg)                                                                                                                                                    
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/overrides.py", line 106, in rpc                                                                                         
    result = await self.connection.rpc(msg, encoder=TypeEncoder)                                                                                                                                     
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 476, in rpc                                                                                            
    raise errors.JujuAPIError(result)                                                                                                                                                                            
juju.errors.JujuAPIError: unknown watcher id                                                                                                                                                                    
2021-04-14 11:03:02 [ERROR] Task exception was never retrieved                                                                                                                                         
future: <Task finished coro=<Model._watch.<locals>._all_watcher() done, defined at /home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/model.py:870> exception=JujuAPIError('unknown watcher id',)>
Traceback (most recent call last):                                                                                                                                                                      
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/model.py", line 920, in _all_watcher                                                                                                                                              
    await allwatcher.Stop()                                                                                                                                             
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 480, in wrapper                                                                                                          
    reply = await f(*args, **kwargs)                                                                                                                                                               
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/_client1.py", line 237, in Stop                                                                                    
    reply = await self.rpc(msg)                                                                                                                                                                   
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/overrides.py", line 106, in rpc                                                                                                                   
    result = await self.connection.rpc(msg, encoder=TypeEncoder)                                                                                                                                                                                            File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 476, in rpc                                                                                                                                           
    raise errors.JujuAPIError(result)                                                                                                                                                                                                                     
juju.errors.JujuAPIError: unknown watcher id                                                                                                                                                       
2021-04-14 11:03:06 [INFO] Checking workload status of ceilometer-hacluster/0                                                                                                                       
2021-04-14 11:03:11 [INFO] Checking workload status message of ceilometer-hacluster/0                                                                                                                                                                     
2021-04-14 11:03:13 [INFO] Checking workload status of cinder-hacluster/0                                                                                                              
2021-04-14 11:03:16 [INFO] Checking workload status message of cinder-hacluster/0                                                                                                        
2021-04-14 11:03:19 [INFO] Checking workload status of cinder-hacluster/1                                                                                                          
2021-04-14 11:03:23 [INFO] Checking workload status message of cinder-hacluster/1                                                                                                   
2021-04-14 11:03:26 [INFO] Checking workload status of cinder-hacluster/2                                                                                                                                      
2021-04-14 11:03:28 [INFO] Checking workload status message of cinder-hacluster/2                                                                                             
2021-04-14 11:03:33 [INFO] Checking workload status of dashboard-hacluster/2                                                                                                                                     
2021-04-14 11:03:35 [INFO] Checking workload status message of dashboard-hacluster/2                                                                                                     
2021-04-14 11:03:37 [INFO] Checking workload status of dashboard-hacluster/1                                                                                                                                                                              
2021-04-14 11:03:42 [INFO] Checking workload status message of dashboard-hacluster/1                                                                                                                                                                      
2021-04-14 11:03:45 [INFO] Checking workload status of dashboard-hacluster/0                                                                                                                                                                              
2021-04-14 11:03:45 [ERROR] Error in receiver                                                                                                                                                                    
Traceback (most recent call last):                                                                                                                                                 
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 378, in _receiver                                                                                  
    loop=self.loop)                                                                                                                                                                                  
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/utils.py", line 135, in run_with_interrupt                                                                                         
    return task.result()  # may raise exception
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/websockets/protocol.py", line 441, in recv
    self._put_message_waiter.set_result(None)
asyncio.base_futures.InvalidStateError: invalid state
2021-04-14 11:03:45 [ERROR] Task exception was never retrieved
future: <Task finished coro=<_Task.start.<locals>.run() done, defined at /home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py:789> exception=InvalidStateError('invalid state',)>
Traceback (most recent call last):
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 791, in run
    return await self.task()
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 378, in _receiver
    loop=self.loop)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/utils.py", line 135, in run_with_interrupt
    return task.result()  # may raise exception
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/websockets/protocol.py", line 441, in recv
    self._put_message_waiter.set_result(None)
asyncio.base_futures.InvalidStateError: invalid state
2021-04-14 11:03:45 [ERROR] Task exception was never retrieved
future: <Task finished coro=<_Task.start.<locals>.run() done, defined at /home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py:789> exception=InvalidStateError('invalid state',)>
Traceback (most recent call last):
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 791, in run
    return await self.task()
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 421, in _pinger
    loop=self.loop)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/utils.py", line 135, in run_with_interrupt
    return task.result()  # may raise exception
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 410, in _do_ping
    await pinger_facade.Ping()
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 480, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/_client1.py", line 11009, in Ping
    reply = await self.rpc(msg)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 623, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 468, in rpc
    result = await self._recv(msg['request-id'])
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 370, in _recv
    return await self.messages.get(request_id)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/utils.py", line 69, in get
    raise value
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 410, in _do_ping
    await pinger_facade.Ping()
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 480, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/_client1.py", line 11009, in Ping
    reply = await self.rpc(msg)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 623, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 468, in rpc
    result = await self._recv(msg['request-id'])
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 370, in _recv
    return await self.messages.get(request_id)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/utils.py", line 69, in get
    raise value
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 791, in run
    return await self.task()
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 378, in _receiver
    loop=self.loop)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/utils.py", line 135, in run_with_interrupt
    return task.result()  # may raise exception
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/websockets/protocol.py", line 441, in recv
    self._put_message_waiter.set_result(None)
asyncio.base_futures.InvalidStateError: invalid state
2021-04-14 11:03:45 [ERROR] Task exception was never retrieved

...

2021-04-14 11:04:05 [ERROR] Task exception was never retrieved
future: <Task finished coro=<Model._watch.<locals>._all_watcher() done, defined at /home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/model.py:870> exception=JujuAPIError('unknown watcher id',)>
Traceback (most recent call last):
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/model.py", line 920, in _all_watcher
    await allwatcher.Stop()
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 480, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/_client1.py", line 237, in Stop
    reply = await self.rpc(msg)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/overrides.py", line 106, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 476, in rpc
    raise errors.JujuAPIError(result)
juju.errors.JujuAPIError: unknown watcher id
2021-04-14 11:04:08 [INFO] Checking workload status message of dashboard-hacluster/0
2021-04-14 11:04:13 [INFO] Checking workload status of designate-hacluster/2
2021-04-14 11:04:15 [INFO] Checking workload status message of designate-hacluster/2
2021-04-14 11:04:17 [INFO] Checking workload status of designate-hacluster/1
2021-04-14 11:04:23 [INFO] Checking workload status message of designate-hacluster/1
2021-04-14 11:04:25 [INFO] Checking workload status of designate-hacluster/0
2021-04-14 11:04:29 [INFO] Checking workload status message of designate-hacluster/0
2021-04-14 11:04:33 [INFO] Checking workload status of glance-hacluster/2

It then recovered (as sense above), and then failed again:

2021-04-14 11:04:54 [INFO] Checking workload status of gnocchi-hacluster/0
2021-04-14 11:04:54 [WARNING] RPC: Connection closed, reconnecting
2021-04-14 11:04:54 [ERROR] Error in watcher
Traceback (most recent call last):
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/model.py", line 920, in _all_watcher
    await allwatcher.Stop()
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 480, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/_client1.py", line 237, in Stop
    reply = await self.rpc(msg)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/overrides.py", line 106, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 476, in rpc
    raise errors.JujuAPIError(result)
juju.errors.JujuAPIError: unknown watcher id
2021-04-14 11:04:54 [ERROR] Task exception was never retrieved
future: <Task finished coro=<Model._watch.<locals>._all_watcher() done, defined at /home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/model.py:870> exception=JujuAPIError('unknown watcher id',)>
Traceback (most recent call last):
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/model.py", line 920, in _all_watcher
    await allwatcher.Stop()
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/facade.py", line 480, in wrapper
    reply = await f(*args, **kwargs)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/_client1.py", line 237, in Stop
    reply = await self.rpc(msg)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/overrides.py", line 106, in rpc
    result = await self.connection.rpc(msg, encoder=TypeEncoder)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 476, in rpc
    raise errors.JujuAPIError(result)
juju.errors.JujuAPIError: unknown watcher id
2021-04-14 11:04:59 [INFO] Checking workload status message of gnocchi-hacluster/0
2021-04-14 11:05:05 [INFO] Checking workload status of gnocchi-hacluster/1
2021-04-14 11:05:09 [INFO] Checking workload status message of gnocchi-hacluster/1
2021-04-14 11:05:13 [INFO] Checking workload status of gnocchi-hacluster/2
2021-04-14 11:05:18 [INFO] Checking workload status message of gnocchi-hacluster/2
2021-04-14 11:05:20 [INFO] Checking workload status of heat-hacluster/0
2021-04-14 11:05:27 [INFO] Checking workload status message of heat-hacluster/0
2021-04-14 11:05:29 [INFO] Checking workload status of heat-hacluster/2
2021-04-14 11:05:37 [INFO] Checking workload status message of heat-hacluster/2
2021-04-14 11:05:39 [INFO] Checking workload status of heat-hacluster/1
2021-04-14 11:05:39 [ERROR] Error in receiver
Traceback (most recent call last):
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/client/connection.py", line 378, in _receiver
    loop=self.loop)
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/juju/utils.py", line 135, in run_with_interrupt
    return task.result()  # may raise exception
  File "/home/ubuntu/mojo/.tox/clients/lib/python3.6/site-packages/websockets/protocol.py", line 441, in recv
    self._put_message_waiter.set_result(None)
asyncio.base_futures.InvalidStateError: invalid state

It failed a few more times (with the code handling the disconnections, attempting clean-up and then reconnecting), until it finally settled down.

matuskosut commented 3 years ago

In regards to MAX_FRAME_SIZE mentioned by Simon, I used to see similar disconnects when using jujuna in CI. There was no useful output from websocket error. Fortunately frame size is pretty easy thing to test when initiating Model:

https://github.com/huntdatacenter/jujuna/blob/9ff313199e40b8b73466c2e51e4bb18710df973c/jujuna/helper.py#L75

MAX_FRAME_SIZE = 2**26
model = Model(max_frame_size=MAX_FRAME_SIZE)

in the end for me the major issues were action logs not being trimmed (reusing same model), which was eventually solved by setting constraints to those.

ajkavanagh commented 3 years ago

In regards to MAX_FRAME_SIZE mentioned by Simon, I used to see similar disconnects when using jujuna in CI. There was no useful output from websocket error. Fortunately frame size is pretty easy thing to test when initiating Model:

https://github.com/huntdatacenter/jujuna/blob/9ff313199e40b8b73466c2e51e4bb18710df973c/jujuna/helper.py#L75

MAX_FRAME_SIZE = 2**26
model = Model(max_frame_size=MAX_FRAME_SIZE)

in the end for me the major issues were action logs not being trimmed (reusing same model), which was eventually solved by setting constraints to those.

I've also got the MAX_FRAME_SIZE set as a default for Juju, but it didn't seem to help with the disconnects. When I replaced model.block_until() with a function that does the same disconnection detection, but then simply does a model.disconnect() and a model.connect_model(name) set of calls (handling any exceptions), then I could make progress. i.e. the model did reconnect and could be used. It did (during the same test) disconnect several more times during the (equivalent) model.block_until was completed (i.e. all the conditions became true).

I'm beginning to wonder if the architecture of libjuju is working against us here, but that could be (probably is) because I don't fully understand what is going on in libjuju. As an example, code we use (along with Model) in simplified form to verify if units have started to appear yet:

    model = Model()
    model_name = await async_get_juju_model()
    await model.connect_model(model_name)

    # then check to see if units have arrived:
    await model.block_until(lambda: len(model.units) > 0)

    await model.disconnect()

This shouldn't really work as the thing being blocked on (len(model.units) > 0) is a sync check of something. But this is in async code. So it looks like the model.units makes progress due to the Connector class which uses get_event_loop() to get the same loop as we're making progress on, and that eventually means that ModelState gets updated which is then read by the property units.

What's not clear to me is why the Exceptions that occur don't get cleaned up. I had to add this code to our loop running to clean things up (for our code), but those Tasks that don't cleaned up happen in our loop runner before the clean-up code gets a chance to run.

        loop.run_until_complete(asyncio.wait([task], loop=loop))

        # Let's also cancel any remaining tasks:
        while True:
            pending_tasks = [p for p in asyncio.Task.all_tasks()
                             if not p.done()]
            if pending_tasks:
                logging.info(
                    "async -> sync. cleaning up pending tasks: len: {}"
                    .format(len(pending_tasks)))
                for pending_task in pending_tasks:
                    pending_task.cancel()
                    try:
                        loop.run_until_complete(pending_task)
                    except asyncio.CancelledError:
                        pass
                    except Exception as e:
                        logging.error(
                            "A pending task caused an exception: {}"
                            .format(str(e)))
            else:
                break

Obviously, this happens after the 2021-04-14 11:03:45 [ERROR] Task exception was never retrieved has been logged.

Finally, I do wonder if model.block_until() needs to have a reconnect=False option, where the caller can just say "yeah, it might disconnect, but try to reconnect anyway". For reference, the workaround I came up with is: https://github.com/openstack-charmers/zaza/pull/439

auria commented 2 years ago

The following, on a model with 116 machines and 1291 units, returned:

>>> from juju import loop
>>> from juju.model import Model
>>> model = Model()
>>> loop.run(model.connect("foundations-maas:openstack"))
RPC: Connection closed, reconnecting
RPC: Connection closed, reconnecting
Receiver: Connection closed, reconnecting
RPC: Automatic reconnect failed

When I increased the MAX_FRAME_SIZE to 64MB (4MB is the default), the problem got fixed:

>>> MAX_FRAME_SIZE = 2**26
>>> model = Model(max_frame_size=MAX_FRAME_SIZE)
>>> loop.run(model.connect("foundations-maas:openstack"))
>>>
ajkavanagh commented 2 years ago

@auria oddly enough, I do have this set for a really a big model which does series upgrades (and returns/holds the action data in the model): https://github.com/openstack-charmers/zaza-openstack-tests/blob/c16bba2705839b2d50223a94b3590981cc532751/zaza/openstack/charm_tests/series_upgrade/parallel_tests.py#L64

Relevant section:

        juju.client.connection.Connection.MAX_FRAME_SIZE = 2**32

So obviously, at some point I also realised this for that particular test. Maybe I'll just build in a 2**30 (256MB) as default into zaza to deal with this; however, it would be nice if python-libjuju could detect this, or 'do the right thing' to make it work.

SimonRichardson commented 2 years ago

Relates to #604

juanmanuel-tirado commented 1 year ago

A workaround was merged in https://github.com/openstack-charmers/zaza/issues/402.