Closed fr-Pursuit closed 1 year ago
@edwarnicke
@denis-tingaikin Could we have someone look at this? What information might they collect to help us get to the bottom of it.
@fr-Pursuit Hello!
Seems like you need to change the limits for your docker ;)
This might be useful for you https://github.com/kubernetes-sigs/kind/issues/2586
@denis-tingaikin Thanks! I changed the following sysctl variables on the VMs, and nsmgr
is not crashing anymore:
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 512
My two NSCs still can't ping each other, but that's probably a different issue... I'm still looking into it.
Could you also attach this info?
kubectl describe nodes
for cluster1kubectl describe nodes
for cluster2Here are:
Although I have configured the VM's routing tables to route each 192.168.x.0/24
prefix to the correct VM, The Docker subnets (172.18.0.0/16
and 172.17.1.0/24
) are not accessible outside of the local VM. The fact they are different is probably due to the presence of a configuration file from a previous test I made.
@denis-tingaikin Having chatted with @fr-Pursuit a bit offline, it seems that the underlying issue is that in his setup Nodes do not have an ExternalIP, and the InternalIPs are unreachable between clusters.
I think we can probably fix this by implementing https://github.com/networkservicemesh/cmd-nsmgr-proxy/issues/407 .
Could we get that going quickly?
Having chatted with @fr-Pursuit a bit offline, it seems that the underlying issue is that in his setup Nodes do not have an ExternalIP, and the InternalIPs are unreachable between clusters.
Indeed, that was exactly the problem. I tweaked my setup to allow inter-cluster communication between nodes using their InternalIPs, and the example worked.
However I'm not sure how standard using InternalIPs for inter-cluster communication is. And since kind doesn't appear to let you assign ExternalIPs to node, it would be great if we could use a Service
of type LoadBalancer
to get an ExternalIP for inter-cluster communication (which is possible on kind using MetalLB) as @edwarnicke described in his issue.
Hello @fr-Pursuit
Does the issue still actual for you?
Hi @denis-tingaikin
Everything was fixed, thanks!
@fr-Pursuit Perfect!
Feel free to open new issues if you see any problems :)
Hello!
I've tried to deploy the NSM over interdomain vL3 network example on 3 kind clusters, running on 3 separate VMs.
Unfortunately, on the first two clusters (ie: the ones containing pods that should connect to each other), one of the two replicates of
nsmgr
constantly crashes and is then stuck in theCrashLoopBackOff
state. The logs indicate an error about too many files being opened (can not create node poller: too many open files
) followed by a segfault. You can view the full logs here.I've tried to change the system's soft
nofile
limit, but it didn't fix the problem. By inspecting the other instance ofnsmgr
, I then found out that the soft limit was already changed inside the pod to match the system's hard limit (which is1048576
... I doubt this limit of file descriptors should be exceeded under normal working conditions).Do you have any ideas about what may cause this issue? Thanks!
PS: I'm using MetalLB as a
LoadBalancer
provider. I gave each instance of MetalLB the192.168.x.0/24
prefix, wherex
is the VM number (0, 1 or 2). I then manually updated the VM's routing tables to ensure the packets were correctly routed (I confirmed this setup is working by verifying the Spiffe Federation was successful, and by ensuring the NSE (un)registrations appeared in the registry's logs, on the third cluster)