Open richardstone opened 3 years ago
By default, we are using the name of POD: https://github.com/networkservicemesh/deployments-k8s/blob/main/apps/nsc-memif/nsc.yaml#L29
@richardstone Could you share your diff in the deployment?
Yes that is exactly how i'd like to use, but if I uncomment the mentioned part, it does not work.
Here is my deployment:
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nsc
labels:
app: nsc
spec:
selector:
matchLabels:
app: nsc
replicas: 1
template:
metadata:
labels:
app: nsc
spec:
serviceAccountName: endpoint-nsc
containers:
- name: {{ template "endpoint.name" . }}-nsc
image: {{ template "endpoint.nsc.image" . }}
imagePullPolicy: {{ .Values.images.nsc.pullPolicy }}
env:
- name: NSM_REQUEST_TIMEOUT
value: 1m
- name: SPIFFE_ENDPOINT_SOCKET
value: unix:///run/spire/sockets/agent.sock
# - name: NSM_NAME
# valueFrom:
# fieldRef:
# fieldPath: metadata.name
- name: NSM_NETWORK_SERVICES
value: {{ .Values.type.nsc }}://icmp-responder/nsm-1
- name: NSM_DIAL_TIMEOUT
value: "60s"
- name: NSM_REQUEST_TIMEOUT
value: "300s"
volumeMounts:
- name: spire-agent-socket
mountPath: /run/spire/sockets
readOnly: true
- name: nsm-socket
mountPath: /var/lib/networkservicemesh
readOnly: true
volumes:
- name: spire-agent-socket
hostPath:
path: /run/spire/sockets
type: Directory
- name: nsm-socket
hostPath:
path: /var/lib/networkservicemesh
type: DirectoryOrCreate
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: endpoint-nsc
OK, thanks! Could you also share logs from the container?
Here is the log: cmd-nsc-vpp.txt
@d-uzlov Could you have a look at this issue ASAP?
@richardstone could you provide info about the cluster and operation system you are using?
Also, which exact names did you test besides "endpoint-nsc"? Are the errors in logs the same for all of the names you tested? If not, could you post logs for those different cases? Could you provide logs of a successful run with default settings?
Am I understanding correctly that short names also don't work? Like, for example, just "nsc" as in the deployment config you posted.
I wasn't able to reproduce the issue with the name you provided. However, I was able to get the same error when the name is too long. On my system "too long" is not applicable to names like "endpoint-nsc-7f9c9cddc9-hjsk5", I had to add ~15 symbols to it to make name too long, but maybe the limit is different for your system.
Here are the logs I get when I try to set the NSM_NAME to the value from your logs:
Hi!
Thanks for the fast response. You're right the name was much longer when it failed. Strange thing is that the kernel2kernel example works well with the longer name. Is there any chance you can make the cmd-nsc-vpp to accept a longer name just like the cmd-nsc? Or do you know where this limitation comes from in case of the vpp image?
Oh, it's good to know that we correctly identified the cause of the issue!
The limitation comes from the fact that memif connection uses unix sockets, with file name containing connection id, and cmd-nsc-vpp uses the name from its config as part of the connection id. Linux has a hard limit on max length of the unix socket path, and when name of the container is very long, name of the socket may exceed the limit.
I guess we should either change the way we generate the socket name, or find some workaround.
Yes, thanks a lot for the investigation!
Is it possible that you'll change the method of socket name generation so there won't be differences in limits for the memif and kernel NSC names?
Yeah, I believe that we will fix it to remove limitations.
Very good, Thanks in advance!
@richardstone @d-uzlov Good catch.
@d-uzlov Is there any reason to set a Connection.ID at all in the cmd-nsc rather than letting the normal connection id generation kick in?
@richardstone Looking more closely at this, while I wholeheartedly support your recommended workaround of not including the name in the connection id in cmd-nsc as a fix for NSM 1.0, we need to get a more comprehensive fix for post NSM 1.0 :)
I am seeing some reports that using relative paths can provide a partial workaround. If true, that should give us a reliable way to work around the issue. Thoughts?
@edwarnicke The best for me would be that I'd be able to use the same long name for the memif nsc that I use for the kernel one.
As a workaround I started to look for a way to be able to take a substring of the full pod name as the NSM_NAME variable but I had no luck with finding a solution for that so far. I guess the thing that many NSCs will have the same name (if i leave out the parameter from my deployment so the NSM_NAME would get it's default value for every replica of the NSC) would cause issues when I start to raise the replica number, so it would be good to keep the uniqueness in the name.
@d-uzlov Is there any reason to set a Connection.ID at all in the cmd-nsc rather than letting the normal connection id generation kick in?
I don't think there is any real benefit in manually setting the connection ID here. We already have path name for each path segment, which is usually used to identify where the connection came through.
But we probably don't want to limit the users in which connection ids they can use, so I think memif should support long ids.
I am seeing some reports that using relative paths can provide a partial workaround. If true, that should give us a reliable way to work around the issue. Thoughts?
I was thinking about keeping a map of [connectionId -> UUID], and using these uuids for socket names instead of connection ids.
Relative paths should work too, though I'm not sure if they will be convenient, since we would be changing current working directory of the program, and some clients may not expect this. Also, I didn't research this properly, but we could get issues with multithreading.
@richardstone Net-net: are you OK for NSM 1.0 of we simply don't use the pod name in the connection id, and we can revisit for a better solution post NSM 1.0?
@edwarnicke It's okay for me. Thanks!
@richardstone Temp fix is in. Lets leave this issue open to get a longer term fix.
Hi!
When I set the NSM_NAME parameter of cmd-nsc-vpp to anything other than the default, I get this error: Jun 17 16:19:42.714 [ERRO] [cmd:/bin/cmd-nsc-vpp] (19.1) proxyListener unable to listen on /tmp/memifproxy/endpoint-nsc-795886dc88-577t6-f96ec20f-ede0-499f-b0cc-819e8f735869/memif.socket: listen unixpacket /tmp/memifproxy/endpoint-nsc-795886dc88-577t6-f96ec20f-ede0-499f-b0cc-819e8f735869/memif.socket: bind: invalid argument
The interface seems to be in place for a second, but there are no neighbors and the pod keeps restarting. vppctl show interface address local0 (dn): memif1/0 (up): L3 172.16.1.96/32
I followed this guide: https://github.com/networkservicemesh/deployments-k8s/tree/main/examples/use-cases/Memif2Memif If I don't set the NSM_NAME parameter, it works correctly.
Is it possible that the given name is not handled correctly somewhere, or could you help me on where to look for issues?