juicefs trying connect to an address which is not set in configuration

greensea commented 5 days ago

What happened:

I have a etcd cluster with 3 nodes. The nodes are in the same VPN (100.80.x.x). And the nodes also have an LAN address (192.168.0.x). The nodes are located in different LAN, they can't communicate to each other directly, they have to communicate to each other via VPN(100.80.x.x). Also, I built the cluster within the VPN.

I created an juicefs storage:

juicefs format --storage etcd  --bucket etcd://ss.ts.bbxy.net:2379/ etcd://ss.ts.bbxy.net:2379/myjfs-meta myjfs

Note: ss.ts.bbxy.net is resolved to 100.80.x.x

Then mount it:

juicefs mount  etcd://ss.ts.bbxy.net:2379/myjfs-meta mnt --verbose

Now copy a large file into mnt, the command stucked. juicefs printed some error logs:

...
2024/06/23 17:55:25.744427 juicefs[161011] <DEBUG>: txn with 1 conds and 1 ops took 22.128292ms [tkv_etcd.go:191]
2024/06/23 17:55:25.792084 juicefs[161011] <DEBUG>: txn with 1 conds and 1 ops took 19.145083ms [tkv_etcd.go:191]
{"level":"warn","ts":"2024-06-23T17:55:33.625558+0800","logger":"etcd-client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000dea380/ss.ts.bbxy.net:2380","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 192.168.0.44:2379: i/o timeout\""}
{"level":"info","ts":"2024-06-23T17:55:33.625667+0800","logger":"etcd-client","caller":"v3@v3.5.9/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-06-23T17:55:33.645211+0800","logger":"etcd-client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0017fca80/ss.ts.bbxy.net:2380","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 192.168.0.33:2379: i/o timeout\""}
{"level":"info","ts":"2024-06-23T17:55:33.645272+0800","logger":"etcd-client","caller":"v3@v3.5.9/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
2024/06/23 17:55:37.963580 juicefs[161011] <DEBUG>: txn with 1 conds and 1 ops took 20.836128ms [tkv_etcd.go:191]
2024/06/23 17:56:18.342892 juicefs[158671] <WARNING>: Upload chunks/0/4/4107_0_4194304: timeout after 1m0s: function timeout (try 7) [cached_store.go:407]
2024/06/23 17:56:18.376189 juicefs[158671] <WARNING>: Upload chunks/0/4/4107_1_4194304: timeout after 1m0s: function timeout (try 7) [cached_store.go:407]
2024/06/23 17:56:18.400583 juicefs[158671] <WARNING>: Upload chunks/0/4/4107_2_4194304: timeout after 1m0s: function timeout (try 7) [cached_store.go:407]
...

The logs shows that juicefs is trying to connect to 192.168.0.44 (LAN address of ss.ts.bbxy.net) and 192.168.0.33 (an other etcd node), which is a LAN address, and I can't connect to this LAN address because it's an other LAN. I think this is the cause of the copy file stuck.

The wired things is, I configure the etcd cluster and the juicefs within the VPN (100.80.x.x), it should not known there is any LAN address(192.168.0.x) and should not try to connect to such address.

What you expected to happen:

cp not stuck. And juicefs not trying to connect to a LAN address (192.168.0.x)

How to reproduce it (as minimally and precisely as possible):

Already describe above.

Anything else we need to know?

No

Environment:

JuiceFS version (use juicefs --version) or Hadoop Java SDK version: juicefs version 1.2.0+2024-06-18.873c47b
Cloud provider or hardware configuration running JuiceFS: self maintained
OS (e.g cat /etc/os-release): Debian trixie/sid
Kernel (e.g. uname -a):Linux 6.8.12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.8.12-1 (2024-05-31) x86_64 GNU/Linux
Object storage (cloud provider and region, or self maintained): self maintained
Metadata engine info (version, cloud provider managed or self maintained): etcd Version: 3.4.33
Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage): tailscaled VPN
Others:

SandyXSD commented 4 days ago

JuiceFS connects to the etcd with the address resolved by the local system DNS. Maybe you can debug on this function to see what really happens.

greensea commented 4 days ago

My local system DNS resolve ss.ts.bbxy.net to 100.80.x.x. There is also no entries in /etc/hosts

I am not familiar to etcd, is it possible some etcd API returns the node's IP addresses and juicefs just pick one of the address to connect?

The wired thing is, if it is a DNS issue, juicefs should not be able to format a system.

I will try debug the function later

greensea commented 4 days ago

I try format and mount the fs by IP address, not domain name, juicefs is still try to connect to a LAN address.

{"level":"warn","ts":"2024-06-25T17:17:14.245594+0800","logger":"etcd-client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001712380/100.80.11.44:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 192.168.0.44:2379: i/o timeout\""}

I format and mount the fs by specifying 100.80.11.44, but juicefs still got 192.168.0.44, this is wired, only the etcd node knows 192.168.0.44 exists. Maybe etcd transmit this 192.168.x.x IP address in someway and juicefs just pick this IP address?

juicedata / juicefs

juicefs trying connect to an address which is not set in configuration #4970