infinyon / fluvio

Lean and mean distributed stream processing system written in rust and web assembly. Alternative to Kafka + Flink in one.
https://www.fluvio.io/
Apache License 2.0
3.9k stars 487 forks source link

[Bug]: Socket io fails in kubernetes cluster `Name or service not known, can't connect to :9005` #4255

Open sagojez opened 2 weeks ago

sagojez commented 2 weeks ago

What happened We're deploying the service with custom resource allocation, for reference our terraform files looks something like this:

resource "kubernetes_namespace" "fluvio_sys" {
  metadata {
    name = "fluvio-sys"
  }
}

# This is a direct reference to a copy of https://github.com/infinyon/fluvio/tree/master/k8-util/helm
resource "helm_release" "fluvio_sys" {
  name       = "fluvio-sys"
  chart     = "../../../../../helm/charts/fluvio-sys"
  version    = "0.12.1"
  namespace  = kubernetes_namespace.fluvio_sys.metadata[0].name
}

# Fluvio development cluster

resource "kubernetes_namespace" "fluvio_development_group" {
  metadata {
    name = "fluvio_development_group"
  }
}

resource "helm_release" "fluvio" {
  name       = "fluvio"
  chart     = "../../../../../helm/charts/fluvio-app"
  version    = "0.12.1"
  namespace  = kubernetes_namespace.fluvio_development_group.metadata[0].name

  set {
    name = "service.type"
    value = "ClusterIP"
  }
}

resource "kubernetes_manifest" "fluvio_spugroup_main" {
  manifest = {
    apiVersion = "fluvio.infinyon.com/v1"
    kind = "SpuGroup"

    metadata = {
      name = "main"
      namespace = kubernetes_namespace.fluvio_development_group.metadata[0].name
    }

    spec = {
      replicas = 1
    }
  }
}

resource "kubernetes_manifest" "fluvio_topic_events" {
  ...
}

resource "kubernetes_manifest" "fluvio_topic_dlq" {
  ...
}

However, when running fluvio cluster spu list I find that the public address is wrongly formatted, i.e. it shows under the Public Endpoint only the port :10000.

When using fluvio cluster start, I can see a proper address with a proper port (see image below). My assumption is that something is missing, however we don't see any option to set the hostname in the templates given for the k8s deployment. Image

Expected behavior I would expect the Public Endpoint to be either configurable or at least properly setted when trying to customize the resources via terraform.

Describe the setup

Log output

SPG:

fluvio_sc::k8::controllers::spu_service: k8 config: ScK8Config {
    image: "infinyon/fluvio:0.12.1",
    pod_security_context: Some(
        PodSecurityContext {
            fs_group: None,
            run_as_group: None,
            run_as_non_root: None,
            run_as_user: None,
            sysctls: [],
        },
    ),
    lb_service_annotations: {},
    service: Some(
        ServiceSpec {
            cluster_ip: "",
            external_ips: [],
            load_balancer_ip: None,
            type: Some(
                NodePort,
            ),
            external_name: None,
            external_traffic_policy: None,
            ports: [],
            selector: None,
        },
    ),
    spu_pod_config: PodConfig {
        node_selector: {},
        resources: Some(
            ResourceRequirements {
                limits: Object {
                    "memory": String("1Gi"),
                },
                requests: Object {
                    "memory": String("256Mi"),
                },
            },
        ),
        storage_class: None,
        base_node_port: Some(
            30005,
        ),
        extra_containers: [],
        extra_env: [],
        extra_volume_mounts: [],
        extra_volumes: [],
    },
}
...

SC:


2024-11-13T19:27:48.050287Z ERROR MetadataDispatcher{spec="SpuService" namespace="development-fluvio"}:process_ws_action: fluvio_stream_dispatcher::dispatcher::metadata: error: SpuService, applying Failure (422):Service "fluvio-spu-main-0" is invalid: spec.ports[0].nodePort: Invalid value: 30004: provided port is already allocated.
2024-11-13T19:27:58.013944Z ERROR fluvio_sc::k8::controllers::spu_service: error with inner loop: Custom {
    kind: TimedOut,
    error: "store timed out: SpuService Apply: fluvio-spu-main-0 loop: 2, timer: 10000 ms",
}
...
Socket error: Socket io failed to lookup address information: Name or service not known, can't connect to :9005"
sagojez commented 2 weeks ago

In case anyone wants to work with this, here's a repository with the minimal set to reproduce the issue: https://github.com/sagoez/flv-scaffold/tree/main

sagojez commented 1 day ago

After further investigation, we managed to identify the root cause of the issue. It appears that the following configuration is problematic:

  set {
    name = "service.type"
    value = "ClusterIP"
  }

That said, I believe the SC should be able to connect to the SPU without relying on a LoadBalancer. Is there a way to modify this behavior, or is it an intentional design choice?

sehz commented 1 day ago

Can you link to which helm chart values you are overriding?

The SC and SPU uses internal network configuration to talk to each other (which is different from public network (where client connects to).

SC never starts communication to SPU. It's SPU that initiates SC using private port. SC will only allow communication from registered list. The network configuration is out of scope for this repo since network configuration is more of deployment concern and will be different for each deployment architecture ( AWS, GCP, private data center). It is assumed that deployment operator will configure such that SPU can reach SC. The configuration in this repo is meant to only work on most simplistic scenario and that's only one will be supported.

You can test SC reachability from SPU running ping command within SPU pod. There are other tools out there that can help diagnose network configuration.

You can also reach out to support@infinyon.com for commercial support.