bysnupy commented 4 years ago

Description: As changing dnsPolicy to ClusterFirstWithHostNet, EFS Operator can refer the custom EFS DNS records without much modifications through DNS Operator. To implements hostAliases on the DaemonSet, it would modify much changes at the reconciling loop logic, but it's simple and we can get the same result through a little bit changes.
More Reference: https://github.com/kubernetes-sigs/aws-efs-csi-driver/issues/240
More informations:
- Pod's DNS Policy
- In my short test, I could not find out odd behaviours.

Before changing dnsPolicy(ClusterFirst), DaemonSet refer host /etc/resolv.conf. It will be causes not to use custom DNS records using DNS Operator. And EFS DNS name cannot resolve under VPC peering.

$ oc rsh efs-csi-node-66hnr cat /etc/resolv.conf
Defaulting container name to efs-plugin.
Use 'oc describe pod/efs-csi-node-66hnr -n openshift-operators' to see all of the containers in this pod.
search ap-northeast-1.compute.internal
nameserver 10.0.0.2

After changing dnsPolicy(ClusterFirstWithHostNet), the EFS DNS name can resolve and pod can mount SharedVolume under VPC peering using custom records through DNS Operator


// DNS operator
$ oc edit dns.operator/default
:
spec: 
servers:
- name: efs-nameserver
zones:
- efs.ap-northeast-1.amazonaws.com
forwardPlugin:
  upstreams:
    - 10.0.124.102

// Operator's SCC is chosen correctly even though dnsPolicy was changed. $ oc get pod efs-csi-node-pm2m7 -o yaml | grep scc openshift.io/scc: efs-csi-scc

// the resolv.conf are changed pod namespace ones. $ oc rsh efs-csi-node-pm2m7 cat /etc/resolv.conf Defaulting container name to efs-plugin. Use 'oc describe pod/efs-csi-node-pm2m7 -n openshift-operators' to see all of the containers in this pod. search openshift-operators.svc.cluster.local svc.cluster.local cluster.local ap-northeast-1.compute.internal nameserver 172.30.0.10 options ndots:5

// Yeah, we can get the custom record through DNS Operator $ oc rsh efs-csi-node-fn756 curl -v fs-ec8395cd.efs.ap-northeast-1.amazonaws.com Defaulting container name to efs-plugin. Use 'oc describe pod/efs-csi-node-fn756 -n openshift-operators' to see all of the containers in this pod.

Rebuilt URL to: fs-ec8395cd.efs.ap-northeast-1.amazonaws.com/
Trying 10.24.10.173... <--- EFS DNS name can be resolved through DNS Operator.
TCP_NODELAY set

codecov-commenter commented 4 years ago

Codecov Report

Merging #27 into master will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #27   +/-   ##
=======================================
  Coverage   75.98%   75.98%           
=======================================
  Files          20       20           
  Lines         683      683           
=======================================
  Hits          519      519           
  Misses        157      157           
  Partials        7        7

bysnupy commented 4 years ago

/assign @2uasimojo

2uasimojo commented 4 years ago

Thanks for this @bysnupy! I'll take a closer look next week with @wgordon17.

2uasimojo commented 4 years ago

@bysnupy We're testing this today and tomorrow.

The daemonset.yaml file isn't actually consumed directly by the operator -- it needs to get generated into bindata. Would you please run make gogenerate and commit the updates?

Meanwhile I've done that locally and built the image for @wgordon17 to play with.

wgordon17 commented 4 years ago

@bysnupy Unfortunately, I'm unable to replicate your success. Were you testing with EFS DNS resolution across VPCs?? Because it's expected that the EFS URL will resolve in the same VPC.

Testing this, it looks like CoreDNS attempts to query upstream server, it doesn't seem to treat it as a lookup replacement.

spec:
  servers:
  - forwardPlugin:
      upstreams:
      - 10.10.29.63
    name: efs-lookup
    zones:
    - efs.us-east-1.amazonaws.com

From CoreDNS logs:

[INFO] 10.128.2.24:49594 - 33253 "A IN fs-34fe53b6.efs.us-east-1.amazonaws.com. udp 68 false 4096" NOERROR - 0 6.000966423s
[ERROR] plugin/errors: 2 fs-34fe53b6.efs.us-east-1.amazonaws.com. A: read udp 10.130.0.80:53087->10.10.29.63:53: i/o timeout
[INFO] 10.128.2.24:49594 - 33253 "A IN fs-34fe53b6.efs.us-east-1.amazonaws.com. udp 68 false 4096" NOERROR - 0 4.000555139s
[ERROR] plugin/errors: 2 fs-34fe53b6.efs.us-east-1.amazonaws.com. A: read udp 10.130.0.80:54223->10.10.29.63:53: i/o timeout

wgordon17 commented 4 years ago

I'm able to hack around this by deploying an in-cluster, custom CoreDNS server that handles the static IP lookup: https://github.com/openshift-cs/OpenShift-Troubleshooting-Templates/tree/master/efs-dns-config

So this is still a viable, albeit slightly hackish solution. But, at least once deployed, will allow non-cluster-admin and non-CCS customers to access EFS volumes over a VPC peering connection

bysnupy commented 4 years ago

@2uasimojo Thank you for your pointing, I've committed it with the generated files again. Some generated files were not changed except putting a period, but I also added all the generated files as those are, because it's real result.

@wgordon17 Thank you for your confirmation and test around this.

Were you testing with EFS DNS resolution across VPCs?? Because it's expected that the EFS URL will resolve in the same VPC.

I've tested through VPC peering with different VPC and subnet IP range between OCP4 cluster and EFS using same AWS account. And I've added custom upstream DNS server which have the custom record of the EFS DNS name for mounting point, after checking not resolved the EFS DNS name and failed to mount EFS efs-csi-node due to DNS name resolution at the pod logs.

Testing this, it looks like CoreDNS attempts to query upstream server, it doesn't seem to treat it as a lookup replacement.

In my test using "dig" command, it could be immediately resolved custom EFS DNS name through configured upstream DNS server, and then the efs-csi-node worked well at that time. upstream DNS server configuration may take a little bit time for takging effects.

If you have any question around my test let me know that. :)

2uasimojo commented 4 years ago

Thanks for regenerating @bysnupy. I agree the changes in punctuation in the comments are annoying and irrelevant, but appropriate to include.

I'm going to /lgtm /approve this because, at worst, it's not hurting anything. But I'd like to /hold it until we get a bit more feedback from @wgordon17 in case there's something else we can add to make it work for his use case.

openshift-ci-robot commented 4 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 2uasimojo, bysnupy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/aws-efs-operator/blob/master/OWNERS)~~ [2uasimojo] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

wgordon17 commented 4 years ago

Thanks @2uasimojo 👍🏻 According to Kubernetes documentation

"ClusterFirstWithHostNet": For Pods running with hostNetwork, you should explicitly set its DNS policy "ClusterFirstWithHostNet".

From my perspective, the ClusterFirstWithHostNet should actually be the correct value for dnsPolicy.

2uasimojo commented 4 years ago

Agree. Thanks for your work on this @bysnupy and @wgordon17.

/hold cancel

openshift / aws-efs-operator

Change dnsPolicy to ClusterFirstWithHostNet for EFS DNS resolution under VPC peering #27

Codecov Report