oracle / vagrant-projects

Vagrant projects for Oracle products and other examples
Universal Permissive License v1.0
930 stars 473 forks source link

OCNE 1.6 - trouble with Calico readiness #481

Closed hussam-qasem closed 1 year ago

hussam-qasem commented 1 year ago

I've submitted a PR made to enable OCNE 1.6. However, I'm having trouble with Calico readiness. Any clues would be greatly appreciated

Screen Shot 2023-05-01 at 5 17 36 PM
/var/log/messages ``` May 1 14:58:38 master1 NetworkManager[9147]: [1682953118.8515] manager: (calico_tmp_B): new Veth device (/org/freedesktop/NetworkManager/Devices/641) May 1 14:58:38 master1 NetworkManager[9147]: [1682953118.8529] manager: (calico_tmp_A): new Veth device (/org/freedesktop/NetworkManager/Devices/642) May 1 14:58:38 master1 systemd-udevd[71677]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. May 1 14:58:38 master1 systemd-udevd[71677]: Could not generate persistent MAC address for calico_tmp_B: No such file or directory May 1 14:58:38 master1 systemd-udevd[71678]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. May 1 14:58:38 master1 systemd-udevd[71678]: Could not generate persistent MAC address for calico_tmp_A: No such file or directory ```
k -n calico-system logs calico-node-* ``` 2023-05-01 14:22:30.722 [INFO][20520] felix/ipsets.go 965: Current state of IP sets family="inet" output="Name: cali40this-host\nType: hash:ip\nRevision: 4\nHeader: family inet hashsize 1024 maxelem 1048576\nSize in memory: 496\nReferences: 0\nNumber of entries: 5\nMembers:\n127.0.0.1\n10.0.2.15\n192.168.56.111\n127.0.0.0\n10.244.200.192\n" 2023-05-01 14:22:30.722 [PANIC][20520] felix/ipsets.go 352: Failed to update IP sets after multiple retries. family="inet" panic: (*logrus.Entry) 0xc0008e2e00 goroutine 153 [running]: github.com/sirupsen/logrus.(*Entry).log(0xc00017aaf0, 0x0, {0xc0005d05a0, 0x30}) /go/pkg/mod/github.com/sirupsen/logrus@v1.9.0/entry.go:260 +0x47e github.com/sirupsen/logrus.(*Entry).Log(0xc00017aaf0, 0x0, {0xc000597b58?, 0x5?, 0x0?}) /go/pkg/mod/github.com/sirupsen/logrus@v1.9.0/entry.go:304 +0x4f github.com/sirupsen/logrus.(*Entry).Panic(...) /go/pkg/mod/github.com/sirupsen/logrus@v1.9.0/entry.go:342 github.com/projectcalico/calico/felix/ipsets.(*IPSets).ApplyUpdates(0xc0003fadc0) /go/src/github.com/projectcalico/calico/felix/ipsets/ipsets.go:352 +0x75d github.com/projectcalico/calico/felix/dataplane/linux.(*InternalDataplane).apply.func1({0x34b3c90?, 0xc0003fadc0?}) /go/src/github.com/projectcalico/calico/felix/dataplane/linux/int_dataplane.go:1995 +0x3d created by github.com/projectcalico/calico/felix/dataplane/linux.(*InternalDataplane).apply /go/src/github.com/projectcalico/calico/felix/dataplane/linux/int_dataplane.go:1994 +0x125f 2023-05-01 14:22:30.791 [INFO][20591] felix/daemon.go 378: Successfully loaded configuration. GOMAXPROCS=1 builddate="2023-03-06T11:01:12+0000" config=&config.Config{UseInternalDataplaneDriver:true, DataplaneDriver:"calico-iptables-plugin", DataplaneWatchdogTimeout:90000000000, WireguardEnabled:false, WireguardEnabledV6:false, WireguardListeningPort:51820, WireguardListeningPortV6:51821, WireguardRoutingRulePriority:99, WireguardInterfaceName:"wireguard.cali", WireguardInterfaceNameV6:"wg-v6.cali", WireguardMTU:0, WireguardMTUV6:0, WireguardHostEncryptionEnabled:false, WireguardPersistentKeepAlive:0, BPFEnabled:false, BPFDisableUnprivileged:true, BPFLogLevel:"off", BPFDataIfacePattern:(*regexp.Regexp)(0xc0008dac80), BPFL3IfacePattern:(*regexp.Regexp)(nil), BPFConnectTimeLoadBalancingEnabled:true, BPFExternalServiceMode:"tunnel", BPFKubeProxyIptablesCleanupEnabled:true, BPFKubeProxyMinSyncPeriod:1000000000, BPFKubeProxyEndpointSlicesEnabled:true, BPFExtToServiceConnmark:0, BPFPSNATPorts:numorstring.Port{MinPort:0x4e20, MaxPort:0x752f, PortName:""}, BPFMapSizeNATFrontend:65536, BPFMapSizeNATBackend:262144, BPFMapSizeNATAffinity:65536, BPFMapSizeRoute:262144, BPFMapSizeConntrack:512000, BPFMapSizeIPSets:1048576, BPFMapSizeIfState:1000, BPFHostConntrackBypass:true, BPFEnforceRPF:"Strict", BPFPolicyDebugEnabled:true, DebugBPFCgroupV2:"", DebugBPFMapRepinEnabled:false, DatastoreType:"kubernetes", FelixHostname:"worker1.vagrant.vm", EtcdAddr:"127.0.0.1:2379", EtcdScheme:"http", EtcdKeyFile:"", EtcdCertFile:"", EtcdCaFile:"", EtcdEndpoints:[]string(nil), TyphaAddr:"", TyphaK8sServiceName:"calico-typha", TyphaK8sNamespace:"calico-system", TyphaReadTimeout:30000000000, TyphaWriteTimeout:10000000000, TyphaKeyFile:"/node-certs/tls.key", TyphaCertFile:"/node-certs/tls.crt", TyphaCAFile:"/etc/pki/tls/certs/tigera-ca-bundle.crt", TyphaCN:"typha-server", TyphaURISAN:"", Ipv6Support:false, BpfIpv6Support:false, IptablesBackend:"auto", RouteRefreshInterval:90000000000, InterfaceRefreshInterval:90000000000, DeviceRouteSourceAddress:net.IP(nil), DeviceRouteSourceAddressIPv6:net.IP(nil), DeviceRouteProtocol:3, RemoveExternalRoutes:true, IptablesRefreshInterval:90000000000, IptablesPostWriteCheckIntervalSecs:1000000000, IptablesLockFilePath:"/run/xtables.lock", IptablesLockTimeoutSecs:0, IptablesLockProbeIntervalMillis:50000000, FeatureDetectOverride:map[string]string(nil), FeatureGates:map[string]string(nil), IpsetsRefreshInterval:10000000000, MaxIpsetSize:1048576, XDPRefreshInterval:90000000000, PolicySyncPathPrefix:"", NetlinkTimeoutSecs:10000000000, MetadataAddr:"", MetadataPort:8775, OpenstackRegion:"", InterfacePrefix:"cali", InterfaceExclude:[]*regexp.Regexp{(*regexp.Regexp)(0xc0008dadc0)}, ChainInsertMode:"insert", DefaultEndpointToHostAction:"ACCEPT", IptablesFilterAllowAction:"ACCEPT", IptablesMangleAllowAction:"ACCEPT", LogPrefix:"calico-packet", LogFilePath:"", LogSeverityFile:"", LogSeverityScreen:"INFO", LogSeveritySys:"", LogDebugFilenameRegex:(*regexp.Regexp)(nil), VXLANEnabled:(*bool)(nil), VXLANPort:4789, VXLANVNI:4096, VXLANMTU:0, VXLANMTUV6:0, IPv4VXLANTunnelAddr:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xa, 0xf4, 0xc8, 0xc0}, IPv6VXLANTunnelAddr:net.IP(nil), VXLANTunnelMACAddr:"", VXLANTunnelMACAddrV6:"", IpInIpEnabled:(*bool)(nil), IpInIpMtu:0, IpInIpTunnelAddr:net.IP(nil), FloatingIPs:"Disabled", AllowVXLANPacketsFromWorkloads:false, AllowIPIPPacketsFromWorkloads:false, AWSSrcDstCheck:"DoNothing", ServiceLoopPrevention:"Drop", WorkloadSourceSpoofing:"Disabled", ReportingIntervalSecs:0, ReportingTTLSecs:90000000000, EndpointReportingEnabled:false, EndpointReportingDelaySecs:1000000000, IptablesMarkMask:0xffff0000, DisableConntrackInvalidCheck:false, HealthEnabled:true, HealthPort:9099, HealthHost:"localhost", HealthTimeoutOverrides:map[string]time.Duration(nil), PrometheusMetricsEnabled:false, PrometheusMetricsHost:"", PrometheusMetricsPort:9091, PrometheusGoMetricsEnabled:true, PrometheusProcessMetricsEnabled:true, PrometheusWireGuardMetricsEnabled:true, FailsafeInboundHostPorts:[]config.ProtoPort{config.ProtoPort{Net:"", Protocol:"tcp", Port:0x16}, config.ProtoPort{Net:"", Protocol:"udp", Port:0x44}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0xb3}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94c}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1561}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x192b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0a}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0b}}, FailsafeOutboundHostPorts:[]config.ProtoPort{config.ProtoPort{Net:"", Protocol:"udp", Port:0x35}, config.ProtoPort{Net:"", Protocol:"udp", Port:0x43}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0xb3}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x94c}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1561}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x192b}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0a}, config.ProtoPort{Net:"", Protocol:"tcp", Port:0x1a0b}}, KubeNodePortRanges:[]numorstring.Port{numorstring.Port{MinPort:0x7530, MaxPort:0x7fff, PortName:""}}, NATPortRange:numorstring.Port{MinPort:0x0, MaxPort:0x0, PortName:""}, NATOutgoingAddress:net.IP(nil), UsageReportingEnabled:true, UsageReportingInitialDelaySecs:300000000000, UsageReportingIntervalSecs:86400000000000, ClusterGUID:"99e7c0c9d4774e1ab828f89985519c4d", ClusterType:"k8s,operator,kubeadm,kdd,typha", CalicoVersion:"v3.25.0", ExternalNodesCIDRList:[]string(nil), DebugMemoryProfilePath:"", DebugCPUProfilePath:"/tmp/felix-cpu-.pprof", DebugDisableLogDropping:false, DebugSimulateCalcGraphHangAfter:0, DebugSimulateDataplaneHangAfter:0, DebugPanicAfter:0, DebugSimulateDataRace:false, RouteSource:"CalicoIPAM", RouteTableRange:idalloc.IndexRange{Min:0, Max:0}, RouteTableRanges:[]idalloc.IndexRange(nil), RouteSyncDisabled:false, IptablesNATOutgoingInterfaceFilter:"", SidecarAccelerationEnabled:false, XDPEnabled:true, GenericXDPEnabled:false, Variant:"Calico", MTUIfacePattern:(*regexp.Regexp)(0xc0008db040), Encapsulation:config.Encapsulation{IPIPEnabled:false, VXLANEnabled:true, VXLANEnabledV6:false}, internalOverrides:map[string]string{}, sourceToRawConfig:map[config.Source]map[string]string{0x1:map[string]string{"CalicoVersion":"v3.25.0", "ClusterGUID":"99e7c0c9d4774e1ab828f89985519c4d", "ClusterType":"k8s,operator,kubeadm,kdd,typha", "FloatingIPs":"Disabled", "HealthPort":"9099", "LogSeverityScreen":"Info", "ReportingIntervalSecs":"0"}, 0x2:map[string]string{"IPv4VXLANTunnelAddr":"10.244.200.192"}, 0x3:map[string]string{"LogFilePath":"None", "LogSeverityFile":"None", "LogSeveritySys":"None", "MetadataAddr":"None"}, 0x4:map[string]string{"datastoretype":"kubernetes", "defaultendpointtohostaction":"ACCEPT", "felixhostname":"worker1.vagrant.vm", "healthenabled":"true", "healthport":"9099", "ipv6support":"false", "typhacafile":"/etc/pki/tls/certs/tigera-ca-bundle.crt", "typhacertfile":"/node-certs/tls.crt", "typhacn":"typha-server", "typhak8snamespace":"calico-system", "typhak8sservicename":"calico-typha", "typhakeyfile":"/node-certs/tls.key"}}, rawValues:map[string]string{"CalicoVersion":"v3.25.0", "ClusterGUID":"99e7c0c9d4774e1ab828f89985519c4d", "ClusterType":"k8s,operator,kubeadm,kdd,typha", "DatastoreType":"kubernetes", "DefaultEndpointToHostAction":"ACCEPT", "FelixHostname":"worker1.vagrant.vm", "FloatingIPs":"Disabled", "HealthEnabled":"true", "HealthPort":"9099", "IPv4VXLANTunnelAddr":"10.244.200.192", "Ipv6Support":"false", "LogFilePath":"None", "LogSeverityFile":"None", "LogSeverityScreen":"Info", "LogSeveritySys":"None", "MetadataAddr":"None", "ReportingIntervalSecs":"0", "TyphaCAFile":"/etc/pki/tls/certs/tigera-ca-bundle.crt", "TyphaCN":"typha-server", "TyphaCertFile":"/node-certs/tls.crt", "TyphaK8sNamespace":"calico-system", "TyphaK8sServiceName":"calico-typha", "TyphaKeyFile":"/node-certs/tls.key"}, Err:error(nil), loadClientConfigFromEnvironment:(func() (*apiconfig.CalicoAPIConfig, error))(0x14562e0), useNodeResourceUpdates:false} gitcommit="d86c70b2d883cdc9cc08a385bfeba2b0e7b18de8" version="d86c70b2d883" 2023-05-01 14:22:30.793 [INFO][20591] felix/bootstrap.go 209: Wireguard is not enabled - ensure no wireguard config iface="wireguard.cali" ipVersion=0x4 nodeName="worker1.vagrant.vm" 2023-05-01 14:22:30.797 [INFO][20591] felix/bootstrap.go 624: Wireguard public key not set in datastore ipVersion=0x4 nodeName="worker1.vagrant.vm" 2023-05-01 14:22:30.797 [INFO][20591] felix/bootstrap.go 209: Wireguard is not enabled - ensure no wireguard config iface="wg-v6.cali" ipVersion=0x6 nodeName="worker1.vagrant.vm" 2023-05-01 14:22:30.800 [INFO][20591] felix/bootstrap.go 624: Wireguard public key not set in datastore ipVersion=0x6 nodeName="worker1.vagrant.vm" 2023-05-01 14:22:30.800 [INFO][20591] felix/driver.go 72: Using internal (linux) dataplane driver. ... 2023-05-01 14:59:44.662 [WARNING][24389] felix/ipsets.go 340: Failed to update IP sets. Marking dataplane for resync. error=exit status 1 family="inet" 2023-05-01 14:59:44.732 [WARNING][24389] felix/ipsets.go 712: Failed to complete ipset restore, IP sets may be out-of-sync. closeErr= commitErr= family="inet" flushErr= input="create cali40all-ipam-pools hash:net family inet maxelem 1048576\ncreate cali4t28 hash:net family inet maxelem 1048576\nadd cali4t28 10.244.0.0/16\nswap cali40all-ipam-pools cali4t28\ndestroy cali4t28\ncreate cali40masq-ipam-pools hash:net family inet maxelem 1048576\ncreate cali4t29 hash:net family inet maxelem 1048576\nadd cali4t29 10.244.0.0/16\nswap cali40masq-ipam-pools cali4t29\ndestroy cali4t29\ncreate cali4t30 hash:ip family inet maxelem 1048576\nadd cali4t30 10.0.2.15\nadd cali4t30 192.168.56.111\nadd cali4t30 10.244.200.192\nadd cali4t30 127.0.0.0\nadd cali4t30 127.0.0.1\nswap cali40this-host cali4t30\ndestroy cali4t30\ncreate cali40all-vxlan-net hash:net family inet maxelem 1048576\ncreate cali4t31 hash:net family inet maxelem 1048576\nadd cali4t31 192.168.56.101/32\nadd cali4t31 192.168.56.112/32\nswap cali40all-vxlan-net cali4t31\ndestroy cali4t31\nCOMMIT\n" processErr=exit status 1 stderr="ipset v7.1: Error in line 1: Kernel error received: set type not supported\n" stdout="" writeErr= ```

I also attempted to install the calico networking module but with similar results:

installation:
  cni:
    type: Calico
  # Configures Calico networking.
  calicoNetwork:
    bgp: Disabled
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - cidr: 10.244.0.0/16
      encapsulation: VXLAN
    # IPV4 for now
    nodeAddressAutodetectionV4:
     interface: eth1
      # natOutgoing: Enabled
      # nodeSelector: all()
  registry: 10.0.2.2:5000
  imagePath: olcne
thtanaka commented 1 year ago

did you have firewall running or not ?

hussam-qasem commented 1 year ago

did you have firewall running or not ?

I disabled firewalld.service ...

thtanaka commented 1 year ago

I guess the logs your provided above is for the calico-node-g5tt6 why did it only happening on one node, could you perhaps share the output of:

  1. kubectl get po -A -o wide
  2. kubectl get nodes -o wide
hussam-qasem commented 1 year ago

Thanks @thtanaka for your reply. Please find the requested below:

Screen Shot 2023-05-03 at 6 56 17 AM Screen Shot 2023-05-03 at 6 53 02 AM

I'm curious, were you unable to replicate the problem? If time permits, please try:

git clone https://github.com/oracle/vagrant-projects
cd vagrant-projects/OCNE
VERBOSE=true vagrant up

(for the screenshots, I also set NB_WORKERS=1)

(some time later, both pods are not ready)

Screen Shot 2023-05-03 at 12 02 15 PM
jromers commented 1 year ago

Does the issue also happens when you try the UEK6 kernel (you are using UEK7 atm) ?

hussam-qasem commented 1 year ago

Hi @jromers

Does the issue also happens when you try the UEK6 kernel (you are using UEK7 atm) ?

Using config.vm.box_version 8.6.359 (UEKR6), it seems to work!!

Screen Shot 2023-05-03 at 3 18 01 PM

How do I make it work on UEKR7?

AmedeeBulle commented 1 year ago

@hussam-qasem

I think the culprit is here:

https://github.com/oracle/vagrant-projects/blob/f46c527a7fc711a63975ec1eb15faf8b1658db7e/OCNE/scripts/provision.sh#L365-L368

Even when using calico, you still need to masquerade here:

https://github.com/oracle/vagrant-projects/blob/f46c527a7fc711a63975ec1eb15faf8b1658db7e/OCNE/scripts/provision.sh#L396

I did a quick test and:

(I don't use K8s these days, so I haven't done thorough testing)

hussam-qasem commented 1 year ago

Thank you @AmedeeBulle . That was exactly my problem. Thank you @jromers for the tip. I have submitted a new PR ~#482~ #483 to re-install kernel-uek-modules for UEKR7.