platinasystems / go

Other
9 stars 68 forks source link

GoES status not OK on i-32 after adding 16K static routes #145

Open sandeep-dutta opened 5 years ago

sandeep-dutta commented 5 years ago

Goes Version root@invader29:/home/sandeep# goes vnetd -version fe1: v1.1.3 fe1a: v1.1.0 vnet-platina-mk1: v1.0.0

Goes build checksum- 5046b7c2cdea8604d331dd7e5dd2fb9c85fa21ff

Kernel version root@invader29:/home/sandeep# dpkg --list |grep kernel ii linux-image-4.13-platina-mk1 4.13-165-gbf3b5fef4591 amd64 Linux kernel, version 4.13-platina-mk1

Noticed that when we add 16K static routes on invader-32 (172.17.2.32) & restart goes after that, vnted service is failing to come up. However this issue has been observed only on this invader. The other invaders participating in regression have vnetd up & running after adding 16k routes & restarting goes.

Steps to reproduce

root@invader32:/home/sandeep# cp 16k_static_route_interfaces /etc/network/interfaces

root@invader32:/home/sandeep# goes status GOES status

Mode - XETH PCI - OK Check daemons - OK Check Redis - OK Check vnet - Not OK status: vnetd daemon not responding

sandeep-dutta commented 5 years ago

Attached is the journalctl output for i-32 for last 5 mins. journalctl_i32.txt

stigt commented 5 years ago

It's well known that if we go over the tcam limit that vnet will call panic and crash. Did you see a vnet stack trace in /var/log/syslog to see if this is the know panic? Of course we need to do something better than crash at too many tcam entries, but I don't think that's in place yet.

sandeep-dutta commented 5 years ago

We have not seen any panic trace under syslog for this issue. In this case we have not gone over the tcam limit. We have used 16035 entries to store these routes under tcam.

sandeep@invader32:~$ ip route |wc -l 16035

kgkannan commented 5 years ago

Couple of inferences based on TH spec and the current goes driver support for L3DEFIP (TCAM) table:

  1. L3DEFIP is currently configured to support max 16K entries (each entry is actually 2 half-entries; half entry is a 32b entry); so the physical tcam limit is 8K rows, each row can accomodate 2 32b entries => 16K 32b entries.
  2. goes driver (like SDK) internally handles LPM ordering of various prefix lengths; there can be corner/limits cases depending on the prefix lengths of the entries being added.

questions for SQA team: please share the sequence, type of entries wiith prefix lengths in the test case

Probable next steps for dev team:

  1. check if there s/w counters at vnet/fib level, fe1 level that can be dumped after add; may need to do this with instrumented image as logging could be disabled at compile time.
  2. force a core and dump these counters from the core file.
sandeep-dutta commented 5 years ago

Hi Govind, Please find the attached 16k interface file which contains 16K routes. The prefix length for all static route entries is /32. 16k_static_route_interfaces.txt

sandeep-dutta commented 5 years ago

The issue is again reproducible on i-32. Attaching logs captured by show_tech.py script.

Current goes version running on i-32 root@invader32:/tmp/log# goes version v1.2.0-rc.1

root@invader32:/tmp/log# dpkg --list |grep kernel 4.13.0-170-ga4eca81e3486 20190129_014719_069.zip

rondv commented 5 years ago

Govind is working on this (unable to update assignee)

kgkannan commented 5 years ago

Quick update on debugging:

  1. problem reported is not a crash but a timeout failure reported by the script which does the config steps to add 16k static routes followed by goes restart. script expects goes-vnetd status OK within 40s (10s + 30s gracetime).
  2. problem was noticed on inv32 frequently but not seen in similar node in regression testbed.
  3. in addition since the show-tech logs include syslog/journalctl, noticed vnet-fib errors in adj path: adjacency.go mpAdjForAdj: index out of range adj AdjNil

Inferences:

  1. after goes restart, in addition to other fdb events vnetd should get 16K+ fdb (route + neighbor) events from kernel and then program 4*16K fib entries (fib, adj for 4 pipes) in TH via dma writes. from repro attempts, linux top shows vnet in tight loop as expected for I/O but mostly returns status OK within 10-15s; didn't see the case where it took > 30s.
  2. noticed adj errors too for few attempts but need more additional vnet logs enabled; when adj errors happen, fe1 does not show 16K routes programmed - its possible hang gets avoided with vnet bailing out with the adj error.

TBD:

  1. working with Sandeep to isolate test environment variables if any and recreate problem consistently
  2. with a baseline/consistent steps, try with reduced scale to debug better (with less logs and dataset).
  3. get more vnet/fdb logs to debug adj errors to triage further; get the exact goes version/tag to build instrumented image with build flags.
sandeep-dutta commented 5 years ago

Hi Govind,

While executing regression with following GoES & kernel version we noticed that vnetd service for i-30 (172.17.2.30) was found not OK.

root@invader30:/tmp/log# goes version v1.2.0-rc0 root@invader30:/tmp/log# dpkg --list |grep kernel ii kmod 18-3 amd64 tools for managing Linux kernel modules ii libdrm2:amd64 2.4.58-2 amd64 Userspace interface to kernel DRM services -- runtime ii linux-image-4.13.0-platina-mk1 4.13.0-178-g13e3790c8eac amd64 Linux kernel, version 4.13.0-platina-mk1 ii rsyslog 8.4.2-1+deb8u2 amd64 reliable system and kernel logging daemon

root@invader30:/tmp/log# goes status GOES status

Mode - XETH PCI - OK Check daemons - OK Check Redis - OK Check vnet - Not OK status: vnetd daemon not responding

The steps that caused the failure was during bringing down & up the interfaces along with goes restart after execution of 16K static route test case

ifdown -a --allow vnet ifup -a --allow vnet goes restart

I manually tried calling the cmds but vnetd failed to come up.

Test case steps

However the issue did not occurred on any of the other invaders of setup-1 (i-29, i31 & i-32). I have not rebooted the invader & kept as it is. You could take a look at it. I will try to see if I can reproduce this on any other invader.

Please find the attached logs generated via show_tech.py script. 20190208_053642_837.zip