vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
290 stars 90 forks source link

Netty GRPC throws java.net.ConnectException: Address family not supported by protocol #1023

Closed jasha64 closed 1 month ago

jasha64 commented 1 month ago

Describe the question Attempt to connect to the vHive master node from within a serverless pod via Netty's GRPC. But strangely it keeps reporting "java.net.ConnectException: Address family not supported by protocol", even if I've specified -Djava.net.preferIPv6Stack=true or vice versa on both the server and the client sides. I wonder if this has anything to do with vHive's network layer.

To Reproduce I ran the following on Cloudlab's Utah cluster xl170 nodes.

  1. Set up vHive (I did it in stock-only mode)
  2. Run GRPC server on vHive master node. (I used pixelsdb/pixels)
  3. Deploy a Knative serverless service with GRPC client inside it. (I used Docker image docker.io/jasha64/pixels-worker-vhive-stream:202409251842)
  4. Attempt a GRPC by calling the Knative service.

Logs Since I used stock-only, no logs from vhive, firecracker-containerd is available. Serverless pod logs by kubectl logs pixels-00001-deployment-bc9b5bd6-xxxxx:

Defaulted container "user-container" out of: user-container, queue-proxy
Picked up JAVA_TOOL_OPTIONS: -Djava.net.preferIPv6Stack=true
2024-09-25 19:00:50,914 [io.pixelsdb.pixels.worker.vhive.WorkerServer]-[INFO] rpc server run successfully
2024-09-25 19:24:01,160 [io.pixelsdb.pixels.worker.vhive.BasePartitionedJoinStreamWorker]-[DEBUG] register worker, local address: 192.168.137.141
2024-09-25 19:24:01,162 [io.pixelsdb.pixels.worker.vhive.BasePartitionedJoinStreamWorker]-[ERROR] error during join
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[pixels-worker-vhive.jar:?]
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[pixels-worker-vhive.jar:?]
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.turbo.WorkerCoordinateServiceGrpc$WorkerCoordinateServiceBlockingStub.registerWorker(WorkerCoordinateServiceGrpc.java:473) ~[pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.planner.coordinate.WorkerCoordinateService.registerWorker(WorkerCoordinateService.java:72) ~[pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.BasePartitionedJoinStreamWorker.process(BasePartitionedJoinStreamWorker.java:159) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.PartitionedJoinStreamWorker.handleRequest(PartitionedJoinStreamWorker.java:39) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.PartitionedJoinStreamWorker.handleRequest(PartitionedJoinStreamWorker.java:29) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.utils.ServiceImpl.execute(ServiceImpl.java:72) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.WorkerServiceImpl.process(WorkerServiceImpl.java:82) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.turbo.vHiveWorkerServiceGrpc$MethodHandlers.invoke(vHiveWorkerServiceGrpc.java:289) [pixels-worker-vhive.jar:?]
    at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) [pixels-worker-vhive.jar:?]
    at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:354) [pixels-worker-vhive.jar:?]
    at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866) [pixels-worker-vhive.jar:?]
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [pixels-worker-vhive.jar:?]
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [pixels-worker-vhive.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: connect(..) failed: Address family not supported by protocol: /[fe80::9af2:b3ff:fec8:69a4]:18894
Caused by: java.net.ConnectException: connect(..) failed: Address family not supported by protocol
    at io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.unix.Socket.connect(Socket.java:313) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:773) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollSocketChannel.doConnect0(EpollSocketChannel.java:144) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:758) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:600) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:533) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:54) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.grpc.netty.WriteBufferingAndExceptionHandler.connect(WriteBufferingAndExceptionHandler.java:157) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:391) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[pixels-worker-vhive.jar:?]
    ... 1 more

containerd output:

time="2024-09-17T14:52:19.212297716-06:00" level=info msg="StartContainer for \"fd78bff11b2f940f8102dd582c7575692bc3d5f17a84a3175441281504e2b028\""
time="2024-09-17T14:52:19.309308442-06:00" level=info msg="StartContainer for \"fd78bff11b2f940f8102dd582c7575692bc3d5f17a84a3175441281504e2b028\" returns successfully"
time="2024-09-25T12:51:34.538431788-06:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
time="2024-09-25T12:51:34.538527143-06:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
time="2024-09-25T12:51:34.538546387-06:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
time="2024-09-25T12:51:34.538731594-06:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/0d1b86515f58d6f606d9bf7f7bed5e4b2b1a213668e9117753e8582db23702a8 pid=644259 runtime=io.containerd.runc.v2
time="2024-09-25T12:51:34.816160420-06:00" level=info msg="shim disconnected" id=0d1b86515f58d6f606d9bf7f7bed5e4b2b1a213668e9117753e8582db23702a8
time="2024-09-25T12:51:34.816258135-06:00" level=warning msg="cleaning up after shim disconnected" id=0d1b86515f58d6f606d9bf7f7bed5e4b2b1a213668e9117753e8582db23702a8 namespace=moby
time="2024-09-25T12:51:34.816305386-06:00" level=info msg="cleaning up dead shim"
time="2024-09-25T12:51:34.828227143-06:00" level=warning msg="cleanup warnings time=\"2024-09-25T12:51:34-06:00\" level=info msg=\"starting signal loop\" namespace=moby pid=644348 runtime=io.containerd.runc.v2\n"
leokondrashov commented 1 month ago

Hello!

I don't think we tested anything with IPv6, which seems to be the problem in the logs. Can you try IPv4 addresses?

Can you please also specify the following:

jasha64 commented 1 month ago

Hello, the problem remains when using IPv4:

Defaulted container "user-container" out of: user-container, queue-proxy
Picked up JAVA_TOOL_OPTIONS: -Djava.net.preferIPv4Stack=true
2024-09-26 13:32:13,257 [io.pixelsdb.pixels.worker.vhive.WorkerServer]-[INFO] rpc server run successfully
2024-09-26 13:35:17,870 [io.pixelsdb.pixels.worker.vhive.BasePartitionedJoinStreamWorker]-[DEBUG] register worker, local address: 192.168.137.190
2024-09-26 13:35:17,908 [io.pixelsdb.pixels.worker.vhive.BasePartitionedJoinStreamWorker]-[ERROR] error during join
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[pixels-worker-vhive.jar:?]
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[pixels-worker-vhive.jar:?]
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.turbo.WorkerCoordinateServiceGrpc$WorkerCoordinateServiceBlockingStub.registerWorker(WorkerCoordinateServiceGrpc.java:473) ~[pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.planner.coordinate.WorkerCoordinateService.registerWorker(WorkerCoordinateService.java:72) ~[pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.BasePartitionedJoinStreamWorker.process(BasePartitionedJoinStreamWorker.java:159) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.PartitionedJoinStreamWorker.handleRequest(PartitionedJoinStreamWorker.java:39) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.PartitionedJoinStreamWorker.handleRequest(PartitionedJoinStreamWorker.java:29) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.utils.ServiceImpl.execute(ServiceImpl.java:72) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.worker.vhive.WorkerServiceImpl.process(WorkerServiceImpl.java:82) [pixels-worker-vhive.jar:?]
    at io.pixelsdb.pixels.turbo.vHiveWorkerServiceGrpc$MethodHandlers.invoke(vHiveWorkerServiceGrpc.java:289) [pixels-worker-vhive.jar:?]
    at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) [pixels-worker-vhive.jar:?]
    at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:354) [pixels-worker-vhive.jar:?]
    at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:866) [pixels-worker-vhive.jar:?]
    at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) [pixels-worker-vhive.jar:?]
    at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) [pixels-worker-vhive.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: connect(..) failed: Address family not supported by protocol: /128.110.218.225:18894
Caused by: java.net.ConnectException: connect(..) failed: Address family not supported by protocol
    at io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.unix.Socket.connect(Socket.java:313) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:773) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollSocketChannel.doConnect0(EpollSocketChannel.java:144) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:758) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:600) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1342) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:533) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:54) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.grpc.netty.WriteBufferingAndExceptionHandler.connect(WriteBufferingAndExceptionHandler.java:157) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:548) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.access$1000(AbstractChannelHandlerContext.java:61) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext$9.run(AbstractChannelHandlerContext.java:538) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:391) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[pixels-worker-vhive.jar:?]
    at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[pixels-worker-vhive.jar:?]
    ... 1 more

I did not deploy the GRPC server inside a service or a container; it's a database backend running as a daemon process on the same node as the vHive master. (Therefore its placement is irrelevant to vHive.) The client is a query worker running as a serverless service, where the GRPC client side code resides. I called GRPC when I ran a SQL query inside the trino CLI, which accesses the client's serverless service via the Knative URL http://pixels.default.192.168.1.240.sslip.io; then the query worker will try to access the GRPC server using IP address 128.110.218.225. If you would like to reproduce, I installed vHive and pixels on the master node, deployed docker.io/jasha64/pixels-worker-vhive-stream:202409251834 as a serverless cloud function, set up minio on the other node inside the Cloudlab cluster, and then ran queries via trino; I can try to add you to my Cloudlab cluster or send you more deployment documents.

jasha64 commented 1 month ago

It turned out that this is a bug with GRPC. See https://github.com/pixelsdb/pixels/commit/9ed6776b1116822cb2c7abbf86ee65580601d2ce