sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
201 stars 727 forks source link

Arista EOS can not complete the initialization during add topology #1403

Open gord1306 opened 4 years ago

gord1306 commented 4 years ago

Description For t1, t1-lag topologies which use much more Arista EOS VMs, it has high possibility randomly to encounter that Arista EOS can not complete the initialization during add topology. Has anyone ever encounter this issue. It looks like aaa service hang.

ISOLINUX 4.05 2011-12-09  Copyright (C) 1994-2011 H. Peter Anvin et al
Loading linux.....
Loading initrd.....ready.
Aboot 8.0.0-3255441
running dosfsck
Seek to -259129856:Invalid argument
dosfsck 2.11, 12 Mar 2005, FAT32, LFN
Press Control-C now to enter Aboot shell
Booting flash:/vEOS.swi
[    7.610535] Starting new kernel
[    0.000000] Requested dmamem_size = 0x8000000 is too much,reducing to 'available_memory - 2GB' = 0x668d000
Switching rootfsWelcome to Arista Networks EOS 4.15.9M
Mounting filesystems:  [  OK  ]
Starting udev: [  OK  ]
Setting hostname localhost:  [  OK  ]
Entering non-interactive startup
Starting TimeAgent: [  OK  ]
Starting ProcMgr: [  OK  ]
Starting EOS initialization stage 1: /bin/sh: /sys/devices/system/edac/mc/mc0/sdram_scrub_rate: No such file or directory
[  OK  ]
ip6tables: Applying firewall rules: [  OK  ]
iptables: Applying firewall rules: [  OK  ]
iptables: Loading additional modules: nf_conntrack_tftp [  OK  ]
Starting system logger: [  OK  ]
Starting NorCal initialization: [  OK  ]
Retrigger failed udev events[  OK  ]
Starting mcelog daemon
Starting EOS initialization stage 2: [  OK  ]
Starting Power On Self Test (POST): [  OK  ]
Starting crond: [  OK  ]
Completing EOS initialization (press ESC to skip):

It stay in this state more than 1 hours. Sometimes it will success to login prompt, sometimes not

If I directly press ESC to skip, it may show aaa no responding

EOS will continue to boot without waiting for full initialization.
You may not be able to login using normal accounts, but you may be
able to login as root.
Model and Serial Number: unknown
System RAM: 2023296 kB
Flash Memory size:  3.8G

Arista Networks EOS 4.15.9M 
localhost login: admin
Arista Networks EOS 4.15.9M
localhost login: admin
[PyServer ar.Aaa not responding, still trying -- is it running?]
[PyServer ar.Aaa not responding, still trying -- is it running?]
[PyServer ar.Aaa not responding, still trying -- is it running?]

And the message is regarding the Aaa service.

-bash-4.1# grep -r 'Completing EOS initialization' /etc
/etc/rc.d/init.d/Aaa:eosmsg="Completing EOS initialization"
/etc/rc.d/rc0.d/K01Aaa:eosmsg="Completing EOS initialization"
/etc/rc.d/rc1.d/K01Aaa:eosmsg="Completing EOS initialization"
/etc/rc.d/rc2.d/K01Aaa:eosmsg="Completing EOS initialization"
/etc/rc.d/rc3.d/S94Aaa:eosmsg="Completing EOS initialization"
/etc/rc.d/rc4.d/S94Aaa:eosmsg="Completing EOS initialization"
/etc/rc.d/rc5.d/S94Aaa:eosmsg="Completing EOS initialization"
/etc/rc.d/rc6.d/K01Aaa:eosmsg="Completing EOS initialization"
nazariig commented 4 years ago

@gord1306

Has anyone ever encounter this issue.

There was a similar issue: sometimes the VM can randomly hang due to kernel crash (something related to spinlocks).

gord1306 commented 4 years ago

@nazariig may I have your hardware info for reference, e.g. cpu model, memory, disk size, and how many veos you launch, thank you. I am trying to reduce the number of VMs now. And hope this change can reduce the issue

nazariig commented 4 years ago

@gord1306 we are aligned with the official technical requirements: https://github.com/Azure/sonic-mgmt/blob/master/ansible/doc/README.testbed.Overview.md#testbed-server

lguohan commented 4 years ago

maybe try cEOS which is much lightweighted than vEOS. caution, still experimental, not all tests have been tested on the cEOS platform.

https://github.com/Azure/sonic-mgmt/blob/master/ansible/doc/README.testbed.VsSetup.md#use-ceos-image-experimental

gord1306 commented 4 years ago

@lguohan, May I consult with you about the recovery mechanism. Did you have encountered this issue? If yes, how you do recover during performing the testing. The current way I can do for recovery is to use virsh reset to reset the VM and perform add topology again.

lguohan commented 4 years ago

yes, that is also my recovery method.

gord1306 commented 4 years ago

@lguohan, I checked the arista.xml.j2, and there is a question about the disk cache mode which is set to writeback and is never changed. But the kvm default value seems to be writethrough, have you tested in writethrough mode before?

lguohan commented 4 years ago

no. I haven't.

nazariig commented 4 years ago

@gord1306 we have found that increasing VM RAM up to 2.5/3 GB per instance has a positive influence on restart/topo change sequence.