oomichi / try-kubernetes

11 stars 5 forks source link

Octavia: Could not connect to instance. Retrying.: requests.exceptions.SSLError: HTTPSConnectionPool(host='192.168.0.72', port=9443): #98

Closed oomichi closed 2 years ago

oomichi commented 4 years ago

以下のように octavia-worker で amphorae インスタンスの設定でエラーが発生、LB 作成に失敗している。

2019-09-24 20:51:30.662 2916 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
   [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectionError:
   HTTPSConnectionPool(host='192.168.0.72', port=9443):
   Max retries exceeded with url: /0.5/info (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f686403a940>: Failed to establish a new connection: [Errno 113] No route to host',))
2019-09-24 20:54:03.460 2916 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver
   [-] Could not connect to instance. Retrying.: requests.exceptions.SSLError: HTTPSConnectionPool(host='192.168.0.72', port=9443): 
   Max retries exceeded with url: /0.5/info (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))
oomichi commented 4 years ago

エラーの内容から、amphorae インスタンスの /0.5/info に対する HTTPS 接続が CERTIFICATE_VERIFY_FAILED で失敗している。 HTTPS 鍵関連の設定がミスってみるように見えるため、設定を見直す。 最新 Devstack ではこの設定方法に変更が入っている。 https://review.opendev.org/#/c/678923/6/devstack/plugin.sh

351     iniset $OCTAVIA_CONF certificates ca_certificate ${OCTAVIA_CERTS_DIR}/server_ca.cert.pem
352     iniset $OCTAVIA_CONF certificates ca_private_key ${OCTAVIA_CERTS_DIR}/private/server_ca.key.pem
353     iniset $OCTAVIA_CONF certificates ca_private_key_passphrase not-secure-passphrase
354     iniset $OCTAVIA_CONF controller_worker client_ca ${OCTAVIA_CERTS_DIR}/client_ca.cert.pem
355     iniset $OCTAVIA_CONF haproxy_amphora client_cert ${OCTAVIA_CERTS_DIR}/private/client.cert-and-key.pem
356     iniset $OCTAVIA_CONF haproxy_amphora server_ca ${OCTAVIA_CERTS_DIR}/server_ca-chain.cert.pem

設定項目は

oomichi commented 4 years ago

ca_certificate の項目が設定されていなかったので、上記で修正。試してみる。 → 少し進んだ。

しかし、引き続き以下のエラーが発生

2019-09-24 22:58:36.676 9196 WARNING octavia.controller.worker.controller_worker [-] Task 'reload-lb-before-allocate-vip' (c7e512ee-3011-470e-be99-73786904a355) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
2019-09-24 22:58:36.677 9196 DEBUG octavia.controller.worker.controller_worker [-] Task 'octavia.controller.worker.tasks.lifecycle_tasks.LoadBalancerIDToErrorOnRevertTask' (1e8d1e66-b7cc-4215-8a22-a68634e96339) transitioned into state 'REVERTING' from state 'SUCCESS' _task_receiver /usr/lib/python3/dist-packages/taskflow/listeners/logging.py:194
2019-09-24 22:58:36.726 9196 WARNING octavia.controller.worker.controller_worker [-] Task 'octavia.controller.worker.tasks.lifecycle_tasks.LoadBalancerIDToErrorOnRevertTask' (1e8d1e66-b7cc-4215-8a22-a68634e96339) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
2019-09-24 22:58:36.729 9196 WARNING octavia.controller.worker.controller_worker [-] Flow 'octavia-create-loadbalancer-flow' (cac27ea7-d400-4ad7-a0cc-9634f0cd2260) transitioned into state 'REVERTED' from state 'RUNNING'
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server [-] Exception during message handling: octavia.amphorae.drivers.haproxy.exceptions.NotFound: Not Found
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/octavia/controller/queue/endpoint.py", line 45, in create_load_balancer
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     self.worker.create_load_balancer(load_balancer_id, flavor)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 292, in wrapped_f
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     return self.call(f, *args, **kw)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 358, in call
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     do = self.iter(retry_state=retry_state)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 319, in iter
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     return fut.result()
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     return self.__get_result()
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     raise self._exception
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/tenacity/__init__.py", line 361, in call
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     result = fn(*args, **kwargs)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/octavia/controller/worker/controller_worker.py", line 343, in create_load_balancer
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     create_lb_tf.run()
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/taskflow/engines/action_engine/engine.py", line 247, in run
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     for _state in self.run_iter(timeout=timeout):
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/taskflow/engines/action_engine/engine.py", line 340, in run_iter
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     failure.Failure.reraise_if_any(er_failures)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/taskflow/types/failure.py", line 339, in reraise_if_any
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     failures[0].reraise()
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/taskflow/types/failure.py", line 346, in reraise
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     six.reraise(*self._exc_info)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     raise value
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     result = task.execute(**arguments)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/octavia/controller/worker/tasks/amphora_driver_tasks.py", line 370, in execute
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     amp_info = self.amphora_driver.get_info(amphora)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 210, in get_info
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     return self.client.get_info(amphora)
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py", line 573, in get_info
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     if exc.check_exception(r):
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/octavia/amphorae/drivers/haproxy/exceptions.py", line 43, in check_exception
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server     raise responses[status_code]()
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server octavia.amphorae.drivers.haproxy.exceptions.NotFound: Not Found
2019-09-24 22:58:36.729 9196 ERROR oslo_messaging.rpc.server
oomichi commented 4 years ago

その前にエラーが発生していた。

2019-09-24 22:58:24.506 9196 DEBUG octavia.controller.worker.controller_worker
   [-] Task 'STANDALONE-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (a106fb81-a8b8-4a7d-be28-ba0443a5a2fc) transitioned into state 'RUNNING' from state 'PENDING' _task_receiver /usr/lib/python3/dist-packages/taskflow/listeners/logging.py:194
2019-09-24 22:58:24.507 9196 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver
   [-] request url info request /usr/lib/python3/dist-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:462
2019-09-24 22:58:24.508 9196 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver
   [-] request url https://192.168.0.22:9443/0.5/info request /usr/lib/python3/dist-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:465
2019-09-24 22:58:31.742 9196 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver
   [-] Connected to amphora. Response: <Response [404]> request /usr/lib/python3/dist-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:486
2019-09-24 22:58:31.743 9196 DEBUG octavia.amphorae.drivers.haproxy.rest_api_driver
   [-] Got a 404 (content-type: application/json) -- connection data: b'{"error":"404 Not Found: The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.","http_code":404}' request /usr/lib/python3/dist-packages/octavia/amphorae/drivers/haproxy/rest_api_driver.py:496
2019-09-24 22:58:31.744 9196 ERROR octavia.amphorae.drivers.haproxy.exceptions
   [-] Amphora agent returned unexpected result code 404 with response {'error': '404 Not Found: The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.', 'http_code': 404}
2019-09-24 22:58:31.776 9196 WARNING octavia.controller.worker.controller_worker
   [-] Task 'STANDALONE-octavia-create-amp-for-lb-subflow-octavia-amp-compute-connectivity-wait' (a106fb81-a8b8-4a7d-be28-ba0443a5a2fc) transitioned into state 'FAILURE' from state 'RUNNING'

これは、前回 HTTPS 接続エラーだった https:// インスタンス /0.5/info への接続が NotFound(404) でエラーになったことが原因。 https://ask.openstack.org/en/question/123664/octavia-amphora-curl-got-a-404/ によると、そもそも Train バージョンでは https:// インスタンス /1.0/info が正しいらしい

octavia/amphorae/backends/agent/api_server/server.py

 37 PATH_PREFIX = '/' + api_server.VERSION
 ..
 99         self.app.add_url_rule(rule=PATH_PREFIX + '/info',
100                               view_func=self.get_info,
101                               methods=['GET'])

octavia/amphorae/backends/agent/api_server/init.py

 15 VERSION = '1.0'

下記の commit でバージョン変更になった

commit 06ce4777c3bad2868e906101bcd2cd7c753b4859
Author: Michael Johnson <johnsomor@gmail.com>
Date:   Thu Jun 27 16:00:22 2019 -0700

    Fix multi-listener load balancers

    Load balancers with multiple listeners, running on an amphora image
    with HAProxy 1.8 or newer can experience excessive memory usage that
    may lead to an ERROR provisioning_status.
    This patch resolves this issue by consolidating the listeners into
    a single haproxy process inside the amphora.

    Story: 2005412
    Task: 34744
    Co-Authored-By: Adam Harwell <flux.adam@gmail.com>
    Change-Id: Idaccbcfa0126f1e26fbb3ad770c65c9266cfad5b

結構最近 そもそもインストールしている Octavia は Stein だし、Train は開発中だから今回のテスト環境にはインストールされないはず・・ と思ったら、Amphora の作成手順が最新 master でやっていたから、そこでバージョン差異が発生している。

stable/stein ブランチで Amphora イメージを作成しようと思ったら、上記のパッチがバックポートされていて version が 1.0 に変更されている・・・ しょうがないので、上記パッチがバックポートされる直前の Commit を使って作成してみる。 → コレをすると tox -e build の修正パッチがマージされる前なので、 tox によるイメージ作成ができなくなる・・・(ひでーな)しょうがないので、スクリプトを使った方法を試す。

$ sudo apt-get install python-pip qemu
$ sudo pip install diskimage-builder
$ sudo pip install dib-utils
$ cd
$ git clone https://github.com/openstack/diskimage-builder
$ git clone https://github.com/openstack/tripleo-image-elements
$ git clone https://github.com/openstack/octavia
$ cd octavia/
$ git checkout -b build-amphora 40440b8a0b97fc119762c37187e57f0e5b634a51
$ ./diskimage-create/diskimage-create.sh

下記のように途中で失敗

2019-09-25 01:18:56.878 | (0001 / 0001)
2019-09-25 01:18:56.902 | Caching amphora-agent from https://review.openstack.org/stackforge/octavia in /home/ubuntu/.cache/image-create/source-repositories/amphora_agent_0b620a391d89365fe7271173f1dcc8ff300301ea
2019-09-25 01:18:57.748 | fatal: remote error: Git repository not found

tox でビルドするため、stable/stein にして Version のところだけ Revert する。 とおもったら、stein では build が存在しない。 しょうがないので master の Version のところを Revert して対応する。

$ git checkout master
$ git diff
diff --git a/octavia/amphorae/backends/agent/api_server/__init__.py b/octavia/amphorae/backends/agent/api_server/__init__.py
index 5de10d17..f685c838 100644
--- a/octavia/amphorae/backends/agent/api_server/__init__.py
+++ b/octavia/amphorae/backends/agent/api_server/__init__.py
@@ -12,4 +12,4 @@
 #    License for the specific language governing permissions and limitations
 #    under the License.

-VERSION = '1.0'
+VERSION = '0.5'
$ cd diskimage-create/
$ tox -e build

既存の amphora イメージを削除して、Glance に再登録。 → 症状、変わらず

$ openstack image delete amphora-x64-haproxy
$ openstack image create --disk-format qcow2 --container-format bare --file ./amphora-x64-haproxy.qcow2 amphora-x64-haproxy
$ openstack image set --tag amphora amphora-x64-haproxy
oomichi commented 4 years ago

なぜ、上記でうまくいかないのか不明。 そもそも、Polling 元はどこで Version を指定してる? octavia/amphorae/drivers/haproxy/rest_api_driver.py そもそも新旧バージョンのクライアントが用意されているように見える。 設定ファイルでの切替え方法は?

673 class AmphoraAPIClient0_5(AmphoraAPIClientBase):
674     def __init__(self):
675         super(AmphoraAPIClient0_5, self).__init__()
...
789 class AmphoraAPIClient1_0(AmphoraAPIClientBase):
790     def __init__(self):
791         super(AmphoraAPIClient1_0, self).__init__()

Stein バージョンだと、そもそも2つのバージョンのクライアントが入っていない

 47 class HaproxyAmphoraLoadBalancerDriver(
 48     driver_base.AmphoraLoadBalancerDriver,
 49         vrrp_rest_driver.KeepalivedAmphoraDriverMixin):
 50
 51     def __init__(self):
 52         super(HaproxyAmphoraLoadBalancerDriver, self).__init__()
 53         self.client = AmphoraAPIClient()
 54         self.cert_manager = stevedore_driver.DriverManager(
 55             namespace='octavia.cert_manager',
 56             name=CONF.certificates.cert_manager,
 57             invoke_on_load=True,
 58         ).driver
oomichi commented 4 years ago

stable/rocky をチェックアウトしてみた。 イメージ作成はうまくいくが、 https://192.168.0.80:9443/0.5/info への接続が timeout している。

oomichi commented 4 years ago

amphora が v1.0 に変更された stable/stein をベースにした Octavia の Ubuntu パッケージがリリースされた後、バージョンアップして試すことにすることにした。 cloud-provider-openstack の Issue にも登録済み。

oomichi commented 4 years ago

4.0.0 (Stein) の Tag でチェックアウトしてビルドしたところ、ビルドは成功した。 しかし、引き続き timeout している。 そもそも /0.5/info のエンドポイントが存在しない場合は NotFound (404) のはずなのにその前段である timeout が発生しているということは別の問題に直面している可能性が高い。 security group の見直しは完了 -> 問題なし:9443/TCP が security group で許可され、その security group が octavia.conf で指定されていることを確認済み Ping による疎通 -> 問題なし:Ping通る。つまり 9443/TCP 接続が失敗

$ nova list
+--------------------------------------+----------------------------------------------+--------+------------+-------------+--------------------------+
| ID                                   | Name                                         | Status | Task State | Power State | Networks                 |
+--------------------------------------+----------------------------------------------+--------+------------+-------------+--------------------------+
| 08909f84-6f58-4544-8513-e124cc8a980a | amphora-d756a30b-fc0c-4667-b757-ed54dccfd513 | ACTIVE | -          | Running     | lb-mgmt-net=192.168.0.36 |
+--------------------------------------+----------------------------------------------+--------+------------+-------------+--------------------------+
$ ping 192.168.0.36
PING 192.168.0.36 (192.168.0.36) 56(84) bytes of data.
64 bytes from 192.168.0.36: icmp_seq=1 ttl=64 time=0.511 ms
64 bytes from 192.168.0.36: icmp_seq=2 ttl=64 time=0.357 ms
...

consolo-log の確認 -> ログが取れない

$ nova console-log amphora-d756a30b-fc0c-4667-b757-ed54dccfd513
ERROR (TypeError): write() argument must be str, not bytes
$
oomichi commented 4 years ago

その前に Neutron 関連のログが Octavia 側で出ていた。

2019-09-25 23:05:10.462 9196 DEBUG octavia.controller.worker.tasks.network_tasks [-] Deallocating a VIP 192.168.1.196 execute /usr/lib/python3/dist-packages/octavia/controller/worker/tasks/
network_tasks.py:457
2019-09-25 23:05:10.590 9196 DEBUG octavia.network.drivers.neutron.base [-] Neutron extension security-group found enabled _check_extension_enabled /usr/lib/python3/dist-packages/octavia/ne
twork/drivers/neutron/base.py:66
2019-09-25 23:05:10.719 9196 DEBUG neutronclient.v2_0.client [-] Error message: {"message": "The resource could not be found.<br /><br />\nExtension with alias dns-integration does not exis
t\n\n", "code": "404 Not Found", "title": "Not Found"} _handle_fault_response /usr/lib/python3/dist-packages/neutronclient/v2_0/client.py:259
2019-09-25 23:05:10.719 9196 DEBUG octavia.network.drivers.neutron.base [-] Neutron extension dns-integration is not enabled _check_extension_enabled /usr/lib/python3/dist-packages/octavia/
network/drivers/neutron/base.py:70
2019-09-25 23:05:10.849 9196 DEBUG neutronclient.v2_0.client [-] Error message: {"message": "The resource could not be found.<br /><br />\nExtension with alias qos does not exist\n\n", "cod
e": "404 Not Found", "title": "Not Found"} _handle_fault_response /usr/lib/python3/dist-packages/neutronclient/v2_0/client.py:259
2019-09-25 23:05:10.849 9196 DEBUG octavia.network.drivers.neutron.base [-] Neutron extension qos is not enabled _check_extension_enabled /usr/lib/python3/dist-packages/octavia/network/driv
ers/neutron/base.py:70
2019-09-25 23:05:10.976 9196 DEBUG octavia.network.drivers.neutron.base [-] Neutron extension allowed-address-pairs found enabled _check_extension_enabled /usr/lib/python3/dist-packages/oct
avia/network/drivers/neutron/base.py:66
2019-09-25 23:05:11.220 9196 DEBUG neutronclient.v2_0.client [-] Error message: {"NeutronError": {"type": "PortNotFound", "message": "Port 2be79251-b99a-4b3b-b9e8-4397914c9e26 could not be
found.", "detail": ""}} _handle_fault_response /usr/lib/python3/dist-packages/neutronclient/v2_0/client.py:259
2019-09-25 23:05:11.221 9196 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Can't deallocate VIP because the vip port 2be79251-b99a-4b3b-b9e8-4397914c9e26 cannot be found
 in neutron. Continuing cleanup.: octavia.network.base.PortNotFound: port not found (port id: 2be79251-b99a-4b3b-b9e8-4397914c9e26).