projectcalico / canal

Policy based networking for cloud native applications
717 stars 100 forks source link

Canal on GCE using CoreOS does not work #107

Closed mattymo closed 6 years ago

mattymo commented 7 years ago

Expected Behavior

Pods should be able to ping each other

Current Behavior

All services report healthy, but flannel container in canal pod shows the following type of error:

Possible Solution

Steps to Reproduce (for bugs)

  1. Create CoreOS instances in GCE
  2. Set up an ansible inventory such as this:
    
    k8s-mattymo-test2-1 ansible_ssh_host=104.199.90.98
    k8s-mattymo-test2-2 ansible_ssh_host=35.195.129.28
    [kube-master]
    k8s-mattymo-test2-1

[kube-node] k8s-mattymo-test2-2

[etcd] k8s-mattymo-test2-1

[k8s-cluster:children] kube-node kube-master


3. Run ansible with -e kube_network_plugin=canal
4. Try to ping pod IPs from any host or from any pod to another.

## Context
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->
Pod logs:
I don't have full flannel logs at the moment, but this type of message repeats constantly:
`5 vxlan_network.go:241] L3 miss but route for 10.233.95.3 not found`

calico-node http://paste.openstack.org/show/2R0SriTdthfMATf8t46V/
policy controller http://paste.openstack.org/show/ndCEAIAmYFmstYx2Hkxu/
endpoints http://paste.openstack.org/show/1im9g356CrPgFBT14f9o/
profile http://paste.openstack.org/show/OTVomtNV8CWn1Ikcp45m/
## Your Environment
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Calico version: v2.5.0
* Flannel version: v0.8.0
* Orchestrator version: Kubespray from master
* Operating System and version: CoreOS stable (latest from GCE)
* Link to your project (optional): github.com/kubernetes-incubator/kubespray

More details:
CoreOS + Canal works fine on vagrant
CoreOS + Flannel works fine on all platforms (including GCE)
CoreOS + Calico works fine on all platforms (including GCE)
Ubuntu and CentOS + Flannel works fine on GCE

I tried changing the backend from vxlan to gce, but no change in behavior.

The actual canal manifest being used: https://github.com/kubernetes-incubator/kubespray/blob/master/roles/network_plugin/canal/templates/canal-node.yaml.j2
waldolf commented 6 years ago

I have the same issue.

k8s v1.8.1 + canal (1.7/canal.yaml)

ozdanborne commented 6 years ago

Sounds like this issue: https://github.com/coreos/flannel/issues/427 Looks like some vxlan improvements have made it into Flannel v0.9.0.

I'll look at getting a canal release ASAP that includes Flannel v0.9.0. Hopefully should resolve this problem.

ozdanborne commented 6 years ago

The new canal manifests (which have been moved to https://docs.projectcalico.org/v2.6/getting-started/kubernetes/installation/hosted/canal/) now include Flannel v0.9.1 so I'm going to close this issue.

I see that kubespray has bumped to v0.9 as well, so if anyone is still hitting this there make sure you've updated.

I suggest anyone still hitting this issue on Flannel v0.9+ please open a new issue instead of commenting here.

Thanks!

mattymo commented 6 years ago

Confirmed: Flannel v0.9 fixes our issues.