Unable to communicate with RabbitMQ outside the mesh

crhuber commented 5 years ago

Bug description

We have a number of rabbitmq consumer pods with Istio sidecar running. These pods should consume messages from rabbitmq running as a hosted service via cloudamqp. As soon as the pod starts the consumer container never receives any messages even though the queue has messages waiting to be consumed. When we turn off sidecar injection, messages get consumed as expected. Connectivity to cloudampq seems to be broken when Istio is running.

Oddly, When we run a second process of the rabbitmq consumer within the original container that had connectivity problems while the sidecar is running, we are then able to connect and consume messages.

To troubleshoot we created a ServiceEntry resource but it did not seem to have an impact.

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: cloudamqp-external-mesh
spec:
  hosts:
  - cloudamqp.fqdn.tld
  ports:
  - name: rabbitmq
    number: 5672
    protocol: TCP
  location: MESH_EXTERNAL
  resolution: NONE

I confirmed traffic policy is ALLOW_ANY

kubectl get configmap istio -n istio-system -o yaml | grep -o "mode: ALLOW_ANY"
mode: ALLOW_ANY

Here are stats from the sidecar:

cluster.xds-grpc.assignment_stale: 0
cluster.xds-grpc.assignment_timeout_received: 0
cluster.xds-grpc.bind_errors: 0
cluster.xds-grpc.circuit_breakers.default.cx_open: 0
cluster.xds-grpc.circuit_breakers.default.cx_pool_open: 0
cluster.xds-grpc.circuit_breakers.default.rq_open: 0
cluster.xds-grpc.circuit_breakers.default.rq_pending_open: 0
cluster.xds-grpc.circuit_breakers.default.rq_retry_open: 0
cluster.xds-grpc.circuit_breakers.high.cx_open: 0
cluster.xds-grpc.circuit_breakers.high.cx_pool_open: 0
cluster.xds-grpc.circuit_breakers.high.rq_open: 0
cluster.xds-grpc.circuit_breakers.high.rq_pending_open: 0
cluster.xds-grpc.circuit_breakers.high.rq_retry_open: 0
cluster.xds-grpc.http2.header_overflow: 0
cluster.xds-grpc.http2.headers_cb_no_stream: 0
cluster.xds-grpc.http2.rx_messaging_error: 0
cluster.xds-grpc.http2.rx_reset: 0
cluster.xds-grpc.http2.too_many_header_frames: 0
cluster.xds-grpc.http2.trailers: 0
cluster.xds-grpc.http2.tx_reset: 0
cluster.xds-grpc.internal.upstream_rq_200: 5
cluster.xds-grpc.internal.upstream_rq_2xx: 5
cluster.xds-grpc.internal.upstream_rq_503: 1
cluster.xds-grpc.internal.upstream_rq_5xx: 1
cluster.xds-grpc.internal.upstream_rq_completed: 6
cluster.xds-grpc.lb_healthy_panic: 1
cluster.xds-grpc.lb_local_cluster_not_ok: 0
cluster.xds-grpc.lb_recalculate_zone_structures: 0
cluster.xds-grpc.lb_subsets_active: 0
cluster.xds-grpc.lb_subsets_created: 0
cluster.xds-grpc.lb_subsets_fallback: 0
cluster.xds-grpc.lb_subsets_fallback_panic: 0
cluster.xds-grpc.lb_subsets_removed: 0
cluster.xds-grpc.lb_subsets_selected: 0
cluster.xds-grpc.lb_zone_cluster_too_small: 0
cluster.xds-grpc.lb_zone_no_capacity_left: 0
cluster.xds-grpc.lb_zone_number_differs: 0
cluster.xds-grpc.lb_zone_routing_all_directly: 0
cluster.xds-grpc.lb_zone_routing_cross_zone: 0
cluster.xds-grpc.lb_zone_routing_sampled: 0
cluster.xds-grpc.max_host_weight: 1
cluster.xds-grpc.membership_change: 1
cluster.xds-grpc.membership_degraded: 0
cluster.xds-grpc.membership_excluded: 0
cluster.xds-grpc.membership_healthy: 1
cluster.xds-grpc.membership_total: 1
cluster.xds-grpc.original_dst_host_invalid: 0
cluster.xds-grpc.retry_or_shadow_abandoned: 0
cluster.xds-grpc.update_attempt: 15
cluster.xds-grpc.update_empty: 0
cluster.xds-grpc.update_failure: 0
cluster.xds-grpc.update_no_rebuild: 14
cluster.xds-grpc.update_success: 15
cluster.xds-grpc.upstream_cx_active: 1
cluster.xds-grpc.upstream_cx_close_notify: 2
cluster.xds-grpc.upstream_cx_connect_attempts_exceeded: 0
cluster.xds-grpc.upstream_cx_connect_fail: 0
cluster.xds-grpc.upstream_cx_connect_timeout: 0
cluster.xds-grpc.upstream_cx_destroy: 0
cluster.xds-grpc.upstream_cx_destroy_local: 0
cluster.xds-grpc.upstream_cx_destroy_local_with_active_rq: 0
cluster.xds-grpc.upstream_cx_destroy_remote: 0
cluster.xds-grpc.upstream_cx_destroy_remote_with_active_rq: 4
cluster.xds-grpc.upstream_cx_destroy_with_active_rq: 4
cluster.xds-grpc.upstream_cx_http1_total: 0
cluster.xds-grpc.upstream_cx_http2_total: 5
cluster.xds-grpc.upstream_cx_idle_timeout: 0
cluster.xds-grpc.upstream_cx_max_requests: 0
cluster.xds-grpc.upstream_cx_none_healthy: 1
cluster.xds-grpc.upstream_cx_overflow: 0
cluster.xds-grpc.upstream_cx_pool_overflow: 0
cluster.xds-grpc.upstream_cx_protocol_error: 0
cluster.xds-grpc.upstream_cx_rx_bytes_buffered: 69
cluster.xds-grpc.upstream_cx_rx_bytes_total: 52078227
cluster.xds-grpc.upstream_cx_total: 5
cluster.xds-grpc.upstream_cx_tx_bytes_buffered: 0
cluster.xds-grpc.upstream_cx_tx_bytes_total: 12457743
cluster.xds-grpc.upstream_flow_control_backed_up_total: 0
cluster.xds-grpc.upstream_flow_control_drained_total: 0
cluster.xds-grpc.upstream_flow_control_paused_reading_total: 0
cluster.xds-grpc.upstream_flow_control_resumed_reading_total: 0
cluster.xds-grpc.upstream_internal_redirect_failed_total: 0
cluster.xds-grpc.upstream_internal_redirect_succeeded_total: 0
cluster.xds-grpc.upstream_rq_200: 5
cluster.xds-grpc.upstream_rq_2xx: 5
cluster.xds-grpc.upstream_rq_503: 1
cluster.xds-grpc.upstream_rq_5xx: 1
cluster.xds-grpc.upstream_rq_active: 1
cluster.xds-grpc.upstream_rq_cancelled: 0
cluster.xds-grpc.upstream_rq_completed: 6
cluster.xds-grpc.upstream_rq_maintenance_mode: 0
cluster.xds-grpc.upstream_rq_pending_active: 0
cluster.xds-grpc.upstream_rq_pending_failure_eject: 4
cluster.xds-grpc.upstream_rq_pending_overflow: 0
cluster.xds-grpc.upstream_rq_pending_total: 5
cluster.xds-grpc.upstream_rq_per_try_timeout: 0
cluster.xds-grpc.upstream_rq_retry: 0
cluster.xds-grpc.upstream_rq_retry_overflow: 0
cluster.xds-grpc.upstream_rq_retry_success: 0
cluster.xds-grpc.upstream_rq_rx_reset: 0
cluster.xds-grpc.upstream_rq_timeout: 0
cluster.xds-grpc.upstream_rq_total: 5
cluster.xds-grpc.upstream_rq_tx_reset: 0
cluster.xds-grpc.version: 0
cluster_manager.active_clusters: 530
cluster_manager.cds.update_attempt: 47
cluster_manager.cds.update_failure: 4
cluster_manager.cds.update_rejected: 0
cluster_manager.cds.update_success: 42
cluster_manager.cds.version: 8476332205929295747
cluster_manager.cluster_added: 530
cluster_manager.cluster_modified: 0
cluster_manager.cluster_removed: 0
cluster_manager.cluster_updated: 1472
cluster_manager.cluster_updated_via_merge: 0
cluster_manager.update_merge_cancelled: 0
cluster_manager.update_out_of_merge_window: 0
cluster_manager.warming_clusters: 0
http_mixer_filter.total_check_cache_hit_accepts: 0
http_mixer_filter.total_check_cache_hit_denies: 0
http_mixer_filter.total_check_cache_hits: 0
http_mixer_filter.total_check_cache_misses: 0
http_mixer_filter.total_check_calls: 0
http_mixer_filter.total_quota_cache_hit_accepts: 0
http_mixer_filter.total_quota_cache_hit_denies: 0
http_mixer_filter.total_quota_cache_hits: 0
http_mixer_filter.total_quota_cache_misses: 0
http_mixer_filter.total_quota_calls: 0
http_mixer_filter.total_remote_call_cancellations: 0
http_mixer_filter.total_remote_call_other_errors: 0
http_mixer_filter.total_remote_call_retries: 0
http_mixer_filter.total_remote_call_send_errors: 0
http_mixer_filter.total_remote_call_successes: 0
http_mixer_filter.total_remote_call_timeouts: 0
http_mixer_filter.total_remote_calls: 0
http_mixer_filter.total_remote_check_accepts: 0
http_mixer_filter.total_remote_check_calls: 0
http_mixer_filter.total_remote_check_denies: 0
http_mixer_filter.total_remote_quota_accepts: 0
http_mixer_filter.total_remote_quota_calls: 0
http_mixer_filter.total_remote_quota_denies: 0
http_mixer_filter.total_remote_quota_prefetch_calls: 0
http_mixer_filter.total_remote_report_calls: 595
http_mixer_filter.total_remote_report_other_errors: 0
http_mixer_filter.total_remote_report_send_errors: 0
http_mixer_filter.total_remote_report_successes: 595
http_mixer_filter.total_remote_report_timeouts: 0
http_mixer_filter.total_report_calls: 1042
listener_manager.lds.update_attempt: 47
listener_manager.lds.update_failure: 4
listener_manager.lds.update_rejected: 0
listener_manager.lds.update_success: 42
listener_manager.lds.version: 8476332205929295747
listener_manager.listener_added: 96
listener_manager.listener_create_failure: 0
listener_manager.listener_create_success: 192
listener_manager.listener_modified: 0
listener_manager.listener_removed: 0
listener_manager.total_listeners_active: 96
listener_manager.total_listeners_draining: 0
listener_manager.total_listeners_warming: 0
server.concurrency: 2
server.days_until_first_cert_expiring: 86
server.debug_assertion_failures: 0
server.hot_restart_epoch: 0
server.live: 1
server.memory_allocated: 49567936
server.memory_heap_size: 79437824
server.parent_connections: 0
server.total_connections: 5
server.uptime: 4478
server.version: 7825363
server.watchdog_mega_miss: 0
server.watchdog_miss: 0
tcp_mixer_filter.total_check_cache_hit_accepts: 0
tcp_mixer_filter.total_check_cache_hit_denies: 0
tcp_mixer_filter.total_check_cache_hits: 0
tcp_mixer_filter.total_check_cache_misses: 0
tcp_mixer_filter.total_check_calls: 0
tcp_mixer_filter.total_quota_cache_hit_accepts: 0
tcp_mixer_filter.total_quota_cache_hit_denies: 0
tcp_mixer_filter.total_quota_cache_hits: 0
tcp_mixer_filter.total_quota_cache_misses: 0
tcp_mixer_filter.total_quota_calls: 0
tcp_mixer_filter.total_remote_call_cancellations: 0
tcp_mixer_filter.total_remote_call_other_errors: 0
tcp_mixer_filter.total_remote_call_retries: 0
tcp_mixer_filter.total_remote_call_send_errors: 0
tcp_mixer_filter.total_remote_call_successes: 0
tcp_mixer_filter.total_remote_call_timeouts: 0
tcp_mixer_filter.total_remote_calls: 0
tcp_mixer_filter.total_remote_check_accepts: 0
tcp_mixer_filter.total_remote_check_calls: 0
tcp_mixer_filter.total_remote_check_denies: 0
tcp_mixer_filter.total_remote_quota_accepts: 0
tcp_mixer_filter.total_remote_quota_calls: 0
tcp_mixer_filter.total_remote_quota_denies: 0
tcp_mixer_filter.total_remote_quota_prefetch_calls: 0
tcp_mixer_filter.total_remote_report_calls: 29
tcp_mixer_filter.total_remote_report_other_errors: 0
tcp_mixer_filter.total_remote_report_send_errors: 0
tcp_mixer_filter.total_remote_report_successes: 29
tcp_mixer_filter.total_remote_report_timeouts: 0
tcp_mixer_filter.total_report_calls: 29
cluster.xds-grpc.upstream_cx_connect_ms: P0(nan,0) P25(nan,0) P50(nan,0) P75(nan,0) P90(nan,0) P95(nan,0) P99(nan,0) P99.5(nan,0) P99.9(nan,0) P100(nan,0)
cluster.xds-grpc.upstream_cx_length_ms: P0(nan,290000) P25(nan,300000) P50(nan,540000) P75(nan,1.7e+06) P90(nan,1.76e+06) P95(nan,1.78e+06) P99(nan,1.796e+06) P99.5(nan,1.798e+06) P99.9(nan,1.7996e+06) P100(nan,1.8e+06)

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x ] Networking [ ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior

Pods that connect to rabbitmq with Istio running have no connectivity issues

Steps to reproduce the bug

Version (include the output of istioctl version --remote and kubectl version)

client version: 1.2.0 citadel version: 1.2.0 galley version: 1.2.0 ingressgateway version: 1.2.0 policy version: 1.2.0 sidecar-injector version: 1.2.0 telemetry version: 1.2.0

Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.7-eks-c57ff8", GitCommit:"c57ff8e35590932c652433fab07988da79265d5b", GitTreeState:"clean", BuildDate:"2019-06-07T20:43:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?

Helm chart

Environment where bug was observed (cloud vendor, OS, etc)

Amazon AMI, Amazon EKS

Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.

crhuber commented 5 years ago

I was able to work around this issue by setting this annotation on the deployment:

traffic.sidecar.istio.io/excludeOutboundPorts: "5672"

However, this is not a permanent solution

rshriram commented 5 years ago

Oddly, When we run a second process of the rabbitmq consumer within the original container that had connectivity problems while the sidecar is running, we are then able to connect and consume messages.

does the consumer talk to itself by any chance? the above observation indicates that connectivity is working as intended.. it could also be that the consumer is racing with pilot [consumer calls out before pilot sends config or something strange of that sort]

crhuber commented 5 years ago

@rshriram Yes, we found it to be a race condition where the consumer is calling out before pilot has sent the config. The consumer didnt have any retry logic and didnt handle the exception so it was difficult to troubleshoot. But ultimately we found that by making the application exit out on exception and allowing the container to be recreated fixed the problem.

Is there a better approach to handling these race conditions?

howardjohn commented 5 years ago

We are working on improving the startup ordering problem -- see https://github.com/istio/istio/issues/11130

istio-policy-bot commented 4 years ago

🚧 This issue or pull request has been closed due to not having had activity from an Istio team member since 2019-07-31. If you feel this issue or pull request deserves attention, please reopen the issue. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

istio / istio

Unable to communicate with RabbitMQ outside the mesh #15896