googleapis / python-pubsub

Apache License 2.0
390 stars 201 forks source link

Observed recoverable stream error 503 failed to connect to all addresses #785

Closed acocuzzo closed 1 year ago

acocuzzo commented 2 years ago

From @sls-cat comment:

504

Is there any hints as to what causes this? Having an application go into infinite error scroll like this is not helpful. I would except a certain number of reconnect attempts and then failure, instead of an infinite look up reconnect attempts. This has been attempting reconnects for a week now.

INFO 2022-04-06 13:11:09 streaming_pull_manager.py:1030 Observed recoverable stream error 503 failed to connect to all addresses
INFO 2022-04-06 13:11:09 bidi.py:486 Re-established stream
INFO 2022-04-06 13:12:01 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 Connection timed out
INFO 2022-04-06 13:12:01 streaming_pull_manager.py:1030 Observed recoverable stream error 503 Connection timed out
INFO 2022-04-06 13:12:01 bidi.py:486 Re-established stream
INFO 2022-04-06 13:14:32 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 Connection timed out
INFO 2022-04-06 13:14:32 streaming_pull_manager.py:1030 Observed recoverable stream error 503 Connection timed out
INFO 2022-04-06 13:14:32 bidi.py:486 Re-established stream
INFO 2022-04-06 13:15:32 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 Connection timed out
INFO 2022-04-06 13:15:32 streaming_pull_manager.py:1030 Observed recoverable stream error 503 Connection timed out
INFO 2022-04-06 13:15:32 bidi.py:486 Re-established stream
INFO 2022-04-06 13:16:42 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:16:42 streaming_pull_manager.py:1030 Observed recoverable stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:16:42 bidi.py:486 Re-established stream
INFO 2022-04-06 13:18:02 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 Connection timed out
INFO 2022-04-06 13:18:02 streaming_pull_manager.py:1030 Observed recoverable stream error 503 Connection timed out
INFO 2022-04-06 13:18:02 bidi.py:486 Re-established stream
INFO 2022-04-06 13:19:31 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 Connection timed out
INFO 2022-04-06 13:19:31 streaming_pull_manager.py:1030 Observed recoverable stream error 503 Connection timed out
INFO 2022-04-06 13:19:32 bidi.py:486 Re-established stream
INFO 2022-04-06 13:20:57 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:20:57 streaming_pull_manager.py:1030 Observed recoverable stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:20:57 bidi.py:486 Re-established stream
INFO 2022-04-06 13:21:50 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:21:50 streaming_pull_manager.py:1030 Observed recoverable stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:21:50 bidi.py:486 Re-established stream
INFO 2022-04-06 13:22:55 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:22:55 streaming_pull_manager.py:1030 Observed recoverable stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:22:55 bidi.py:486 Re-established stream
INFO 2022-04-06 13:24:20 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:24:20 streaming_pull_manager.py:1030 Observed recoverable stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:24:20 bidi.py:486 Re-established stream
INFO 2022-04-06 13:25:40 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 failed to connect to all addresses
INFO 2022-04-06 13:25:40 streaming_pull_manager.py:1030 Observed recoverable stream error 503 failed to connect to all addresses
INFO 2022-04-06 13:25:40 bidi.py:486 Re-established stream
INFO 2022-04-06 13:26:53 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:26:53 streaming_pull_manager.py:1030 Observed recoverable stream error 503 keepalive watchdog timeout
INFO 2022-04-06 13:26:53 bidi.py:486 Re-established stream```

```astroid==2.9.3
attrs==21.4.0
box-pyvault==1.7.0
cachetools==4.2.4
certifi==2021.10.8
charset-normalizer==2.0.12
click==8.0.4
coverage[toml]==6.2
google-api-core[grpc]==2.7.1
google-api-python-client==2.41.0
google-auth==2.6.0
google-auth-httplib2==0.1.0
google-cloud-access-context-manager==0.1.10
google-cloud-asset==3.8.1
google-cloud-bigquery==2.34.2
google-cloud-core==2.2.3
google-cloud-org-policy==1.3.1
google-cloud-os-config==1.11.1
google-cloud-pubsub==2.11.0
google-crc32c==1.3.0
google-resumable-media==2.3.2
googleapis-common-protos[grpc]==1.56.0
grpc-google-iam-v1==0.12.3
grpcio==1.44.0
grpcio-status==1.44.0
httplib2==0.20.4
hvac==0.11.2
idna==3.3
importlib-metadata==4.8.3
iniconfig==1.1.1
isort==5.10.1
lazy-object-proxy==1.7.1
mccabe==0.6.1
mock==4.0.3
packaging==21.3
platformdirs==2.4.0
pluggy==1.0.0
proto-plus==1.20.3
protobuf==3.19.4
py==1.11.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pylint==2.12.2
pyparsing==3.0.7
pytest==7.0.1
pytest-cov==3.0.0
python-dateutil==2.8.2
requests==2.27.1
rsa==4.8
six==1.16.0
toml==0.10.2
tomli==1.2.3
typed-ast==1.5.2
typing-extensions==4.1.1
uritemplate==4.1.1
urllib3==1.26.9
wrapt==1.13.3
zipp==3.6.0

I see it started with this, which appears to have some issues in the SDK:


0 message_size 28611 ack_id U0RQBhYsXUZIUTcZCGhRDk9eIz81IChFEgggagktQlgeDwdcXHUHUQweJChjcztTQAdWT1F3XlwdB2h0bG251Ke7RF9BcWlaEQYETVF_W1keDG1V
WF3zkbe42uXDQ3BmK7-H7_FIf7qtx9ZvZiA9WxJLLD5-NSNFQV5AEkw-FkRJUytDCypYEU4EISE-MD5F) at 2022-03-31 23:24:15.
INFO 2022-03-31 23:24:43 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 Connection timed out
INFO 2022-03-31 23:24:43 streaming_pull_manager.py:1030 Observed recoverable stream error 503 Connection timed out
INFO 2022-03-31 23:24:43 bidi.py:486 Re-established stream
INFO 2022-03-31 23:26:03 streaming_pull_manager.py:1052 Observed non-terminating stream error 503 failed to connect to all addresses
ERROR 2022-03-31 23:26:03 helper_threads.py:120 Error in queue callback worker: 'NoneType' object has no attribute 'set_exception'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib64/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1648794363.414576733","description":"Failed to pick subchannel","file":"src/core/ext/filters/clie
nt_channel/client_channel.cc","file_line":3128,"referenced_errors":[{"created":"@1648794363.414575083","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 68, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.ServiceUnavailable: 503 failed to connect to all addresses

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 589, in send_unary_ack
    self._client.acknowledge(subscription=self._subscription, ack_ids=ack_ids)
  File "/usr/local/lib/python3.6/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1201, in acknowledge
    request, retry=retry, timeout=timeout, metadata=metadata,
  File "/usr/local/lib/python3.6/site-packages/google/api_core/gapic_v1/method.py", line 154, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
    on_error=on_error,
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 210, in retry_target
    ) from last_exc
google.api_core.exceptions.RetryError: Deadline of 60.0s exceeded while calling target function, last exception: 503 failed to connect to all addresses

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/helper_threads.py", line 118, in __call__
    self._callback(items)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py", line 159, in dispatch_callback
    self.ack(ack_requests)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py", line 192, in ack
    ack_reqs_dict=ack_reqs_dict,
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 604, in send_unary_ack
    _process_requests(status, ack_reqs_dict, None)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 204, in _process_requests
    future.set_exception(exc)
AttributeError: 'NoneType' object has no attribute 'set_exception'
INFO 2022-03-31 23:26:03 streaming_pull_manager.py:1030 Observed recoverable stream error 503 failed to connect to all addresses
ERROR 2022-03-31 23:26:03 bidi.py:673 Thread-ConsumeBidirectionalStream caught unexpected exception 'NoneType' object has no attribute 'set_exception' and will exit.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib64/python3.6/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"created":"@1648794363.414594016","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3128,"referenced_errors":[{"created":"@1648794363.414592880","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 68, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.ServiceUnavailable: 503 failed to connect to all addresses

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 644, in send_unary_modack
    ack_deadline_seconds=deadline,
  File "/usr/local/lib/python3.6/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1098, in modify_ack_deadline
    request, retry=retry, timeout=timeout, metadata=metadata,
  File "/usr/local/lib/python3.6/site-packages/google/api_core/gapic_v1/method.py", line 154, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
    on_error=on_error,
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 210, in retry_target
    ) from last_exc
google.api_core.exceptions.RetryError: Deadline of 60.0s exceeded while calling target function, last exception: 503 failed to connect to all addresses

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/google/api_core/bidi.py", line 657, in _thread_main
    self._on_response(response)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 986, in _on_response
    self._send_lease_modacks(ack_id_gen, self.ack_deadline)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 932, in _send_lease_modacks
    self._dispatcher.modify_ack_deadline(items)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py", line 295, in modify_ack_deadline
    ack_reqs_dict=ack_reqs_dict,
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 660, in send_unary_modack
    _process_requests(status, ack_reqs_dict, None)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_protocol/streaming_pull_manager.py", line 204, in _process_requests
    future.set_exception(exc)
AttributeError: 'NoneType' object has no attribute 'set_exception'
INFO 2022-03-31 23:26:03 bidi.py:486 Re-established stream
INFO 2022-03-31 23:26:03 bidi.py:676 Thread-ConsumeBidirectionalStream exiting

_Originally posted by @sls-cat in https://github.com/googleapis/python-pubsub/issues/504#issuecomment-1090767051_
acocuzzo commented 2 years ago

Followup q from: @gzamb

I'm seeing the same thing as @sls-cat. Here are my dependencies:

Package                           Version
--------------------------------- ---------
astroid                           2.8.2
attrs                             21.2.0
backports.entry-points-selectable 1.1.0
bitarray                          2.3.4
black                             21.9b0
CacheControl                      0.12.11
cachetools                        4.2.4
cachy                             0.3.0
certifi                           2021.10.8
cfgv                              3.3.1
charset-normalizer                2.0.7
cleo                              0.8.1
click                             8.0.3
clikit                            0.6.2
coverage                          6.0.2
crashtest                         0.3.1
datadog                           0.42.0
ddtrace                           0.54.0
dependency-injector               4.36.2
distlib                           0.3.3
dotenv-cli                        2.2.0
filelock                          3.3.0
google-api-core                   2.7.3
google-auth                       2.3.0
google-cloud-bigquery             3.0.1
google-cloud-bigquery-storage     2.13.1
google-cloud-core                 2.1.0
google-cloud-pubsub               2.12.1
google-cloud-spanner              3.11.1
google-crc32c                     1.3.0
google-resumable-media            2.0.3
googleapis-common-protos          1.53.0
grpc-google-iam-v1                0.12.3
grpcio                            1.46.3
grpcio-status                     1.41.0
html5lib                          1.1
httpretty                         1.1.4
identify                          2.3.0
idna                              3.2
importlib-metadata                4.11.4
iniconfig                         1.1.1
isort                             5.9.3
keyring                           23.5.1
lazy-object-proxy                 1.6.0
lockfile                          0.12.2
mccabe                            0.6.1
msgpack                           1.0.4
mypy-extensions                   0.4.3
nodeenv                           1.6.0
numpy                             1.22.3
packaging                         21.0
pastel                            0.2.1
pathspec                          0.9.0
pexpect                           4.8.0
pip                               22.1.2
pkginfo                           1.8.2
platformdirs                      2.4.0
pluggy                            1.0.0
poetry                            1.1.13
poetry-core                       1.0.8
powlet                            2.10.1
pre-commit                        2.15.0
proto-plus                        1.19.5
protobuf                          3.18.1
ptyprocess                        0.7.0
py                                1.10.0
pyarrow                           7.0.0
pyasn1                            0.4.8
pyasn1-modules                    0.2.8
pylev                             1.4.0
pylint                            2.11.1
pyparsing                         2.4.7
pytest                            6.2.5
pytest-cov                        2.12.1
pytest-mock                       3.6.1
python-dateutil                   2.8.2
python-dotenv                     0.20.0
python-json-logger                2.0.2
PyYAML                            5.4.1
regex                             2021.10.8
requests                          2.26.0
requests-toolbelt                 0.9.1
rsa                               4.7.2
setuptools                        62.1.0
shellingham                       1.4.0
six                               1.16.0
sqlparse                          0.4.2
tenacity                          8.0.1
toml                              0.10.2
tomli                             1.2.1
tomlkit                           0.11.0
typing-extensions                 3.10.0.2
urllib3                           1.26.7
virtualenv                        20.8.1
webencodings                      0.5.1
wheel                             0.37.1
wrapt                             1.12.1
zipp                              3.8.0

I was previously on v2.8 on python3.9 and it worked fine.

PeteRager commented 2 years ago

I am having the exact same issue. In my case, I’m working on an issue in the Home Assistant Nest Integration effecting hundreds of users.

https://www.home-assistant.io/integrations/nest/ https://github.com/home-assistant/core/issues/70479

The issue is, that when Home Assistant loses connection to the google cloud (home internet outage, LAN failure, etc), there is no notification to the client that the connection has failed AND home assistant continues to believe it is connected and displays stale data for the NEST thermostats - leading the user to believe all is well with their home heating system.

If I override this constant in streaming_manager.py then the stream closes and callback gets called.

    streaming_pull_manager._RETRYABLE_STREAM_ERRORS = ()

But overriding a private member is not a good solve. What would be best is if we could pass a “Retry” object into the client, In which we could allow x number of auto-retries (current behavior) before failing it (desired new behavior.). Similar to what is described here: https://cloud.google.com/python/docs/reference/storage/1.39.0/retry_timeout#configuring-retries

Alternatively, if there is way to accomplish this with the existing API, that would be even better!

2022-10-02 10:18:16.226 INFO (Thread-10 (_run)) [google.api_core.bidi] Re-established stream
2022-10-02 10:18:16.226 DEBUG (Thread-10 (_run)) [google.cloud.pubsub_v1.subscriber._protocol.streaming_pull_manager] Observed non-terminating stream error 503 failed to connect to all addresses; last error: UNKNOWN: Network is unreachable
2022-10-02 10:18:16.226 DEBUG (Thread-10 (_run)) [google.cloud.pubsub_v1.subscriber._protocol.streaming_pull_manager] Observed recoverable stream error 503 failed to connect to all addresses; last error: UNKNOWN: Network is unreachable
2022-10-02 10:18:16.227 DEBUG (Thread-10 (_run)) [google.api_core.bidi] Re-opening stream from gRPC callback.
2022-10-02 10:18:16.232 INFO (Thread-10 (_run)) [google.api_core.bidi] Re-established stream
2022-10-02 10:18:16.233 DEBUG (Thread-ConsumeBidirectionalStream) [google.api_core.bidi] Call to retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f4ba7f450c0>> caused 503 failed to connect to all addresses; last error: UNKNOWN: Network is unreachable.
2022-10-02 10:18:16.233 DEBUG (Thread-ConsumeBidirectionalStream) [google.cloud.pubsub_v1.subscriber._protocol.streaming_pull_manager] Observed non-terminating stream error 503 failed to connect to all addresses; last error: UNKNOWN: Network is unreachable
2022-10-02 10:18:16.233 DEBUG (Thread-ConsumeBidirectionalStream) [google.cloud.pubsub_v1.subscriber._protocol.streaming_pull_manager] Observed recoverable stream error 503 failed to connect to all addresses; last error: UNKNOWN: Network is unreachable
2022-10-02 10:18:16.234 DEBUG (Thread-ConsumeBidirectionalStream) [google.api_core.bidi] Re-opening stream from retryable <bound method ResumableBidiRpc._recv of <google.api_core.bidi.ResumableBidiRpc object at 0x7f4ba7f450c0>>.
2022-10-02 10:18:17.681 DEBUG (Thread-LeaseMaintainer) [google.cloud.pubsub_v1.subscriber._protocol.leaser] The current deadline value is 10 seconds.
2022-10-02 10:18:17.682 DEBUG (Thread-LeaseMaintainer) [google.cloud.pubsub_v1.subscriber._protocol.leaser] Snoozing lease management for 6.052412 seconds.
2022-10-02 10:18:23.735 DEBUG (Thread-LeaseMaintainer) [google.cloud.pubsub_v1.subscriber._protocol.leaser] The current deadline value is 10 seconds.
2022-10-02 10:18:23.735 DEBUG (Thread-LeaseMaintainer) [google.cloud.pubsub_v1.subscriber._protocol.leaser] Snoozing lease management for 8.714147 seconds.
2022-10-02 10:18:26.201 INFO (Thread-ConsumeBidirectionalStream) [google.api_core.bidi] Re-established stream
acocuzzo commented 1 year ago

For your specific use case, I'm wondering if your "staleness" issue could be solved by checking the age of the last processed message. Is this a possible fix for your specific case?

acocuzzo commented 1 year ago

Changing this to a feature request, and moving to: https://issuetracker.google.com/issues/269679604 This will ensure that the fix can happen across client libraries.