srl-labs / srl-controller

k8s controller for SR Linux nodes scheduled by KNE
BSD 3-Clause "New" or "Revised" License
16 stars 3 forks source link

enabling GNMI in Nokia SRL #9

Closed Azharkuntoji closed 2 years ago

Azharkuntoji commented 2 years ago

Hi @hellt

I am trying to enable GNMI server in kind setup. I tried to configure it with the help from this document: https://infocenter.nokia.com/public/SRLINUX200R6A/index.jsp?topic=%2Fcom.srlinux.configbasics%2Fhtml%2Fconfigb-mgmt_servers.html but GNMI is still not starting, it shows it is waiting for configuration even after configuration( attached screenshot) below is my Nokia configuration for gnmi. Please help and if there is any Nokia forum with experts who can help with such issues or queries please provide the link.

A:nokia# show system application gnmi_server +-------------+-----+--------------------+---------+-------------+ | Name | PID | State | Version | Last Change | +=============+=====+====================+=========+=============+ | gnmi_server | | waiting-for-

config | | | +-------------+-----+--------------------+---------+-------------+ --{ [FACTORY] * candidate shared default }--[ ]-- A:nokia# info system gnmi-server system { gnmi-server { admin-state enable timeout 7200 rate-limit 60 session-limit 20 nokia-srl-controller-logs-gcp.txt

        network-instance mgmt {
            admin-state enable
            use-authentication true
            port 50052
            tls-profile tls-profile-1
            source-address [
                ::
            ]
        }
    }
}

--{ [FACTORY] * candidate shared default }--[ ]-- A:nokia#

my goal is to query bgp configurations using gnmi: gnmic -a 10.39.35.50:32097 --skip-verify -u admin -p admin get --path "/network-instances/network-instance/protocols/protocol/bgp/" target "10.39.35.50:32097" get request failed: failed to create a gRPC client for target "10.39.35.50:32097" : 10.39.35.50:32097: context deadline exceeded target "10.39.35.50:32097" get request failed: failed to create a gRPC client for target "10.39.35.50:32097" : 10.39.35.50:32097: context deadline exceeded Error: one or more requests failed

image nokia_srl.zip

ixia-nokia-topology.txt nokia-srl-controller-logs-gcp.txt

Azharkuntoji commented 2 years ago

when i tried to deploy with attached topology with nokia_srl.json file( which has gnmi configuration), The gnmi configurations are not applied and i had to again configure gnmi and tls profiles.

hellt commented 2 years ago

Hi @Azharkuntoji make sure that you enable admin-state in the gnmi context

--{ running }--[  ]--
A:srl# info from running system gnmi-server  
    system {
        gnmi-server {
            admin-state enable
            network-instance mgmt {
                admin-state enable
                tls-profile clab-profile
            }
        }
    }
Azharkuntoji commented 2 years ago

I have admin-state enabled in gnmi-server and everywhere applicable. The nokia_srl.json has the admin-state enabled. It still doesn't work. and there are some error logs in SRL-controller.

hellt commented 2 years ago

@Azharkuntoji can you show me what volume mounts your srlinux pod has?

Azharkuntoji commented 2 years ago

@Azharkuntoji can you show me what volume mounts your srlinux pod has?

Hi @hellt,

If the attached things doesn't give provide you the answer to your question, let me know how to fetch volume mounts for srlinux pods ?

I am attaching kubectl describe for nokia pod. ixia@ondatra:~/featureprofiles$ docker volume inspect 7d6a0200d29986e338459a778438966f3737a5665d12d6af3015b5ee4ffd39c1 [ { "CreatedAt": "2022-07-15T06:00:20Z", "Driver": "local", "Labels": null, "Mountpoint": "/var/lib/docker/volumes/7d6a0200d29986e338459a778438966f3737a5665d12d6af3015b5ee4ffd39c1/_data", "Name": "7d6a0200d29986e338459a778438966f3737a5665d12d6af3015b5ee4ffd39c1", "Options": null, "Scope": "local" } ] nokia-kubectl-describe.txt

hellt commented 2 years ago

From what you shared this line shows that the configmap was mounted by the expected path

    Mounts:
      /etc/opt/srlinux/config.json from startup-config-volume (ro,path="config.json")

The config map that should contain startup config json is created by kne, and is visible in the describe output:

  startup-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nokia-config
    Optional:  false

Now the questions are:

  1. Can you check the contents of /etc/opt/srlinux/config.json file? You can connect to the srlinux node the same way as you would do for any other linux container - kubectl exec -it <podname> -- bash
Azharkuntoji commented 2 years ago

Hi @hellt

I have attached the contents of /etc/opt.srlinux/config.json as a zip file. it does contain startup config json as provide by us.

There are 2 issues seen: 1. while deploying Nokia pod I am seeing generation cert fails for Nokia pod nternalTrafficPolicy:nil,},Status:ServiceStatus{LoadBalancer:LoadBalancerStatus{Ingress:[]LoadBalancerIngress{},},Conditions:[]Condition{},},} INFO[0002] Node "nokia" resource created
INFO[0002] nokia - generating self signed certs
INFO[0002] nokia - waiting for pod to be running
INFO[0008] nokia - pod running.
Error: failed to generate cert for node nokia: timed out waiting for SR Linux node nokia to boot:

2. when i tried to load /etc/opt/srlinux/config.json from srl_cli of nokia pod, I am seeing below error. image

  1. If I remove the gribi-server config from satrtup config and tried to deploy with modified startup config, it still gives generating cert error but Nokia pod is up and running and gnmi is also running state (show system application) but it fails to reply to my gnmic query( below mentioned) ? is this due to certificate generation error ? gnmic -a 10.39.35.50:32097 --skip-verify -u admin -p admin get --path "/network-instances/network-instance/ config.zip
hellt commented 2 years ago
  1. take this config that was created by containerlab for 22.6 version - https://pastebin.com/HZYvmzQC
  2. remove self signed cert generation from your kne topology file, as the cert is already part of the config

try to deploy and check gnmi accessibility

Azharkuntoji commented 2 years ago

Hi @hellt

unable to access the provided link. do i need to login ? or can you paste those files here

image

hellt commented 2 years ago

take it from here https://gist.github.com/hellt/7b16697a6dc9d79f7507c62fd0b2d6eb @Azharkuntoji

Azharkuntoji commented 2 years ago

take it from here https://gist.github.com/hellt/7b16697a6dc9d79f7507c62fd0b2d6eb @Azharkuntoji

Hi @hellt

I tried the file shared by you and passed it as a start up file. There were no Certificate errors. Deployment was successful and I could see gnmi-server is in running state.

But gnmic connection is not getting established with Nokia. As you see i am using gnmic to make connection with machineIP:nokia gnmi port machine ip is where i have deployed nokia pods and gnmi mapped port for nokia. gnmic -a 10.39.33.67:32220 --skip-verify -u admin -p admin get --path "/" I will provide all the logs required for debugging.

image

hellt commented 2 years ago

Use port 57400

On Wed, 20 Jul 2022 at 06:03, Azhar kuntoji @.***> wrote:

take it from here https://gist.github.com/hellt/7b16697a6dc9d79f7507c62fd0b2d6eb @Azharkuntoji https://github.com/Azharkuntoji

Hi @hellt https://github.com/hellt

I tried the file shared by you and passed it as a start up file. There were no Certificate errors. Deployment was successful and I could see gnmi-server is in running state.

But my gnmi query is getting established. As you see i am using gnmic to make connection with machineIP:nokia gnmi port machine ip is where i have deployed nokia pods and gnmi mapped port for nokia. gnmic -a 10.39.33.67:32220 --skip-verify -u admin -p admin get --path "/"

[image: image] https://user-images.githubusercontent.com/94851503/179886484-6c509c32-2477-443f-b7c0-fa70c0a27999.png

— Reply to this email directly, view it on GitHub https://github.com/srl-labs/srl-controller/issues/9#issuecomment-1189753594, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLKV5MDTPUXEKZXLL46U3LVU5T6JANCNFSM53THSG4Q . You are receiving this because you were mentioned.Message ID: @.***>

Azharkuntoji commented 2 years ago

I checked with external IP and 57400 port also

image

hellt commented 2 years ago

Are you using kind to deploy or is it a pure GCP deployment? Does ssh work using external IP?

Azharkuntoji commented 2 years ago

Hi @hellt

I am using kind deployment. I am actually trying Nokia deployment with Ondatra test framework. PING/SSH to Nokia's external IP is not working. The GNMI issue seems to be on Nokia side. Ixia GNMI and Arista GNMI are replying to GNMIC get request.

I did the same experiment with Arista and I am getting gnmic query reply. ( below screenshot shows the result) Same query for Nokia is not working. PING/SSH to Arista External IP is also not working but it is able to get the gnmic query Attaching the topology that I am using to deploy Nokia and Arista both. image ixia-nokia-topology.txt ixia-arista-topology.txt init_arista.txt

hellt commented 2 years ago

@Azharkuntoji Yeah, if you are using kind then make sure that you use the latest kne build where we added bridge ptp plugin that enables srlinux nodes to work without requiring installing bridge plugin manually

Azharkuntoji commented 2 years ago

@hellt

I used latest kind ( kind version is v14.0) and kne build ( i cloned the latest kne from here git clone https://github.com/openconfig/kne.git), still the gnmi query is not replied from Nokia pod. I noticed below error in Nokia sr_cli. It is unable to retrieve TLS profile.

image

i even tried everything with different start up config. Nokia booted up with no generate cert error and Gnmi was up without any error. then I tried to Get GNMI query for Nokia. It failed. I even tried to provide the Source Address in GNMI as external IP.=, that also didn't work. But Arista Works without any issue here. Can you connect me to some expert here or what can i do to make it work?

image image

hellt commented 2 years ago

I think you will have to wait a bit I am working on revisited support for srlinux latest version on kne - this PR is what we need to get over the line https://github.com/openconfig/kne/pull/166

Once it is merged, I brush up examples using the 22.6.1 srl version and share it with you here

hellt commented 2 years ago

@Azharkuntoji check the instructions and a new topo provided here https://github.com/openconfig/kne/pull/169 you can check out that pull request and build the binary for kne_cli to repeat that exercise

Azharkuntoji commented 2 years ago

Hi @hellt,

I cloned the latest kne_cli and build the binary. ( as the PR 169 is already merged to main) I could still see the issue. The deployment is smooth but no reply to gnmic query request for nokia pods. I followed the example steps mentioned, create kind cluster, deploy controller and then deploy the r1 and r2 pods. Did I miss anything or is it not working for me ?

image

hellt commented 2 years ago

@Azharkuntoji make sure that you removed the kind cluster first i.e. kind delete cluster --name kne

Azharkuntoji commented 2 years ago

@Azharkuntoji make sure that you removed the kind cluster first i.e. kind delete cluster --name kne

yes. I made sure the kind cluster is delete and then created new kind cluster using latest kne.

hellt commented 2 years ago

since you don't provide the exact commands you used it is hard to say what did you do wrong the likely root cause of the issue that you see is that tcp checksums are not calculated in your setup and srlinux pods discard such packets. you can use tcpdump on the srlinux side to see if you have correct checsums for the packets arriving from gnmic client to verify that

Azharkuntoji commented 2 years ago

@hellt you are probably right about tcp packets discarded by nokia pods due to tcp checksums

  1. Is there any way to check the dropped counters on Nokia for these tcp packets ?
  2. Can we disable the discard of packets by nokia ?

attaching the tcpdump on mgmt interface mgmt.zip

Azharkuntoji commented 2 years ago

since you don't provide the exact commands you used it is hard to say what did you do wrong

these are steps i am trying

delete kind cluster

rm_kind_cluster() { kind delete cluster 2> /dev/null rm -rf $HOME/.kube rm -rf $HOME/go/bin/kubectl }

create kind cluster

kind create cluster --config=resources/global/kind-config.yaml --wait 5m

srl controller pod creation

kubectl apply -k https://github.com/srl-labs/srl-controller/config/default

create the nokia-nokia topology

kne_cli create topologies/kne/2srl.txt

hellt commented 2 years ago

you didn't follow the steps I mentioned in the PR. Hence you don't have correct checksums follow the guidelines https://github.com/openconfig/kne/pull/169#issue-1315617451 especially pay attention to the way kind cluster is created via kne_cli tool -> ./kne_cli_dev deploy deploy/kne/kind-bridge.yaml

As of today, there is no way to tell srlinux to accept packets with incorrect checksums, that was the reason for us to modify kind cluster CNI plugin to make sure correct checksums are used

Azharkuntoji commented 2 years ago

Hi @hellt

I am getting lot of errors trying to deploy kind cluster using kne-cli. Can you make it easier for me to install it without knowing much ? Can you give me some steps or a document to follow ?

image

hellt commented 2 years ago

@Azharkuntoji you have to follow the instructions, remember. Nowhere in instruction I was using ~/ondatra-tests/resources/global/kind-config.yaml file. Can you try following the instructions word to word?

Azharkuntoji commented 2 years ago

@hellt Thanks for fixing this issue and helping me to follow through. I am closing it