Requests return Bad Request 400 when going through a L7 load balancer

an0nz commented 2 years ago

Describe the bug

Requests return Bad Request 400 when going through a Cloudflare load balancer, someone else has it with AVI also but I do not know their config, it is in the bug report on cloud-provider-for-cloud-director

I am getting this issue in 2 downstream projects used for the vcd container service extension (cloud-provider-for-cloud-director and cloud-director-named-disk-csi-driver)

It looks like requests being made by this piece of code get a 400 bad request when multiple vcd cells are behind a Cloudflare L7 Load Balancer or an AVI Load Balancer (Unknown if its L4 or L7)

I have not submitted a report to cloud-director-named-disk-csi-driver as I just discovered that and the common component is go-vcloud-director. If I add hostaliases to go direct to a particular vcd cell bypassing the load balancer everything works correctly. This also works using the same headers via curl within the containers that have this issue.

I0117 07:03:07.817805       1 serving.go:319] Generated self-signed cert in-memory
W0117 07:03:08.330725       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0117 07:03:08.333182       1 controllermanager.go:117] Version: v0.0.0-master+$Format:%h$
I0117 07:03:08.333479       1 auth.go:49] Using VCD OpenAPI version [36.0]
** REQUEST **
time:    2022-01-17T07:03:09.037Z
method:  GET
host:    vcd.domain.com
length:  0
URL:     https://vcd.domain.com/api/versions
header:  [
    Authorization => [********]
    Accept => [application/*+xml;version=36.0]
    X-Vmware-Vcloud-Token-Type => [Bearer]
    User-Agent => [go-vcloud-director]
]

payload: 

## RESPONSE ##
time:   2022-01-17T07:03:09.048Z
status: 400 - 400 Bad Request 
length: 155
header: map[Cf-Ray:[-] Content-Length:[155] Content-Type:[text/html] Date:[Mon, 17 Jan 2022 07:03:09 GMT] Server:[cloudflare]]
body:   <html>
<head><title>400 Bad Request</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<hr><center>cloudflare</center>
</body>
</html>

I0117 07:03:09.048936       1 cloud.go:92] Error initializing client from secrets: [unable to get swagger client from secrets: [unable to get bearer token from serets: [failed to set authorization header: [error finding LoginUrl: could not find valid version for login: could not retrieve supported versions: error fetching versions: [ParseErr]: error parsing error body for non-200 request: XML syntax error on line 6: element \<hr> closed by </body> (&{Status:400 Bad Request StatusCode:400 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Cf-Ray:[-] Content-Length:[155] Content-Type:[text/html] Date:[Mon, 17 Jan 2022 07:03:09 GMT] Server:[cloudflare]] Body:0xc000291860 ContentLength:155 TransferEncoding:[] Close:true Uncompressed:false Trailer:map[] Request:0xc000032500 TLS:0xc0006208f0})]]]]```

### Reproduction steps

```bash
1. Install vcd 10.3.1+ with multiple cells
2. Expose an FQDN for vcd via a load balancer (Might have to be layer 7)
3. Attempt making requests to /api/versions using govcd

Expected behavior

Requests succeed with 200 OK

Additional context

No response

mrmassis commented 2 years ago

Hi everyone, I have the same problem. Requests return Bad Request 400 when going through a NSX/AVI load balancer. In my pod logs the return is very similar:

7400 TLS:0xc00324b8c0})]]]]
I0119 21:20:37.201341       1 auth.go:49] Using VCD OpenAPI version [36.0]
** REQUEST **
time:    2022-01-19T21:20:37.808Z
method:  GET
host:    dcvirtual-bsa.hom.serpro
length:  0
URL:     https://dcvirtual-bsa.hom.serpro/api/versions
header:  [
    User-Agent => [go-vcloud-director]
    Authorization => [********]
    Accept => [application/*+xml;version=36.0]
    X-Vmware-Vcloud-Token-Type => [Bearer]
]

payload: 

## RESPONSE ##
time:   2022-01-19T21:20:37.938Z
status: 400 - 400 Bad Request 
length: 167
header: map[Content-Length:[167] Content-Type:[text/html] Date:[Wed, 19 Jan 2022 21:20:39 GMT] Server:[NSX LB]]
body:   <html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>NSX LB</center>
</body>
</html>

I0119 21:20:37.938734       1 cloud.go:92] Error initializing client from secrets: [unable to get swagger client from secrets: [unable to get bearer token from serets: [failed to set authorization header: [error finding LoginUrl: could not find valid version for login: could not retrieve supported versions: error fetching versions: [ParseErr]: error parsing error body for non-200 request: XML syntax error on line 6: element <hr> closed by </body> (&{Status:400 Bad Request StatusCode:400 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Content-Length:[167] Content-Type:[text/html] Date:[Wed, 19 Jan 2022 21:20:39 GMT] Server:[NSX LB]] Body:0xc001fe2c00 ContentLength:167 TransferEncoding:[] Close:true Uncompressed:false Trailer:map[] Request:0xc002ef2a00 TLS:0xc00322dad0})]]]]

In the NSX/AVI logs I get the error:

022/01/19 15:00:16 [info] 1247#0: *32731028 client sent duplicate header line: "Authorization: bearer Bearer (...)"

What we infer here in the lab is that the request is arriving with a duplicate header. That is, two authentication headers. This must be checked and dealt with, we were able to simulate the problem by sending a curl with the same duplicated field. What it seems to me is that when we send it directly to the VCD this duplication is not verified, because it is not necessary to have a token to search for the versions. Now when you have an NSX/AVI in between, a more complete parsing is performed on the message and this problem is identified. Among the possible solutions is the correction of the sent fields.

When we change the load balancer from L7 to L4 the process is fine. The problem is L7, that check headers.

an0nz commented 2 years ago

Thanks @mrmassis that makes sense, the Cloudflare logs I have do not show the request at all but I can reproduce the issue with curl and 2 auth headers.

This could be an issue with the cloud-provider-for-cloud-director and cloud-director-named-disk-csi-driver projects defining the authorization header when not needed, but validation in this project to ensure only 1 is ever sent would solve it and make sense from a validation standpoint.

The issue isn't just with the /api/versions request, it basically occurs with all requests in those other 2 projects as I managed to get it past the version check only to hit it again on the next request. Only solution for me was add hostAliases to the deployment/replicaset/statefulset for those projects so the load balancer domain pointed directly one of the cell's IP's

mrmassis commented 2 years ago

As I use CSE with TKG ova and VCD, the cluster kubernetes are provisioned automatically and it would be unfeasible to have to insert hostAliase for each new cluster.

mrmassis commented 2 years ago

It is a problem to change the load balancer to L4 too. L4 dont give me all protect and diagnostics that i need.

an0nz commented 2 years ago

We have the same issue, currently it is a manual fix we are applying when needed.

Didainius commented 2 years ago

Thank you for these details. Just to check - ~~is the same still application with 3.5.1 (ignore this, I thought this is a Terraform issue)? I doubt we have changed anything in that regard, but verifying.~~ Ignore the above. I am looking at SDK and looking for when it can send a duplicate header.

arunmk commented 2 years ago

@Anirudh9794 could you take a look at this

Didainius commented 2 years ago

We were looking at this and it looks that the issue was more related to the usage of this SDK (setting additional headers) rather than the problem within SDK in this case. If we still find something wrong with SDK - we will reopen.

References:

https://github.com/vmware/cloud-provider-for-cloud-director/issues/37 issue in cloud-provider-for-cloud-director
https://github.com/vmware/cloud-provider-for-cloud-director/pull/47 PR in cloud-provider-for-cloud-director

an0nz commented 2 years ago

@Didainius as this is the SDK would it not be good to ensure that a maximum of 1 authorization header is sent by the SDK so that it conforms with RFC7230?

The same issue exists in cloud-director-named-disk-csi-driver and I dont see anything being worked on there so have raised a bug.

Didainius commented 2 years ago

We tend to leave the SDK more of a "passthrough" than enforcing a lot in here. The RFC is a good one, but it immediately negates this being as a rule with the Set-Cookieexample. Making enforcing rules for separate headers doesn't sound like it. The best we could do in such scenarios is log additional line to logs, but I doubt that gives a lot.

As for the cpi/cni/csi guys are making the PR for one places (https://github.com/vmware/cloud-provider-for-cloud-director/pull/47) and I believe they will fix it in other places as well.

mrmassis commented 2 years ago

Hi, the image look well to vmware-cloud-director-ccm. However, i belive that de code is used in more than one PODs (in all PODs that interact with VCD via API). E found similar error in:

kube-system   csi-vcd-controllerplugin-0                   2/3     CrashLoopBackOff   4          2m29s
kube-system   csi-vcd-nodeplugin-ldpfn                    1/2     CrashLoopBackOff   4          2m29s

Looking in the logs:

$ kubectl logs -f csi-vcd-nodeplugin-ldpfn  -n kube-system  -c vcd-csi-plugin -n kube-system
(...)
1 auth.go:49] Using VCD OpenAPI version [36.0]
panic: unable to initiate vcd client: [unable to get swagger client from secrets: [unable to get bearer token from serets: [failed to set authorization header: [error finding LoginUrl: could not find valid version for login: could not retrieve supported versions: error fetching versions: [ParseErr]: error parsing error body for non-200 request: XML syntax error on line 6: element <hr> closed by </body> (&{Status:400 Bad Request StatusCode:400 Proto:HTTP/1.1 ProtoMajor:1 ProtoMinor:1 Header:map[Content-Length:[167] Content-Type:[text/html] Date:[Thu, 17 Feb 2022 12:05:58 GMT] Server:[NSX LB]] Body:0xc00006f160 ContentLength:167 TransferEncoding:[] Close:true Uncompressed:false Trailer:map[] Request:0xc0000e0800 TLS:0xc000240840})]]]]

Bingo. Maybe is necessary create a new image to these two Pods too.

PS: when the code fix will be avaliable to production?

vmware / go-vcloud-director