Open callumdmay opened 2 months ago
Hi @callumdmay,
Thanks for your report. Could you clarify a few things to help us understand the issue? Notably:
Thank you
What DSS version are you using? v0.16.0-rc1-ce9e97d
What's the overall setup of your DSS pool?
Setup of the pool is a single cluster on GKE, using the default Tanka config (i.e. 3 pods of each of the core-service and cockroachdb).
Any chance the operational intent has expired between the time it was created and the time it was tentatively deleted?
No these actions were only 2 seconds apart
Are you able to reliably reproduce the issue, does it appear randomly, or did it happen only once?
I was able to reproduce it with a parallel load qps above 3 QPS
Do you have any logs at hand relevant to the issue?
Are you able to view https://cloudlogging.app.goo.gl/9jStWPozE5c7vHkC8?
Do you have a dump of the culprit request and response? Among other, the message in the 404 response could be interesting.
Request:
{
"insertId": "funxwmf5jv4vj",
"jsonPayload": {
"statusDetails": "response_sent_by_backend",
"@type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
"cacheDecision": [
"RESPONSE_HAS_CONTENT_TYPE",
"REQUEST_HAS_AUTHORIZATION",
"CACHE_MODE_USE_ORIGIN_HEADERS"
],
"remoteIp": "74.125.210.2",
"backendTargetProjectNumber": "projects/229325454109"
},
"httpRequest": {
"requestMethod": "PUT",
"requestUrl": "https://dss.wing-internal.interuss.opensky.dev/dss/v1/operational_intent_references/0a5d60ef-cbd3-46a5-9241-200ded546fba",
"requestSize": "99894",
"status": 201,
"responseSize": "1948",
"userAgent": "Wing-DSS",
"remoteIp": "74.125.210.2",
"serverIp": "10.64.1.4",
"latency": "1.176242s"
},
"resource": {
"type": "http_load_balancer",
"labels": {
"url_map_name": "k8s2-um-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
"project_id": "wing-utm-external-staging",
"forwarding_rule_name": "k8s2-fs-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
"target_proxy_name": "k8s2-ts-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
"backend_service_name": "k8s1-fd370d60-dss-main-core-service-8080-ff9079c6",
"zone": "global"
}
},
"timestamp": "2024-08-08T19:44:27.286542Z",
"severity": "INFO",
"logName": "projects/wing-utm-external-staging/logs/requests",
"trace": "projects/wing-utm-external-staging/traces/0a277f79d85bc9aabeb827f26e9931eb",
"receiveTimestamp": "2024-08-08T19:44:30.770288499Z",
"spanId": "3c8b2db3a4a424e8"
}
Response:
{
"insertId": "8musn9f5y0xt9",
"jsonPayload": {
"remoteIp": "74.125.210.1",
"@type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
"backendTargetProjectNumber": "projects/229325454109",
"statusDetails": "response_sent_by_backend",
"cacheDecision": [
"RESPONSE_HAS_CONTENT_TYPE",
"REQUEST_HAS_AUTHORIZATION",
"CACHE_MODE_USE_ORIGIN_HEADERS"
]
},
"httpRequest": {
"requestMethod": "DELETE",
"requestUrl": "https://dss.wing-internal.interuss.opensky.dev/dss/v1/operational_intent_references/0a5d60ef-cbd3-46a5-9241-200ded546fba/huHqtAVBkqIgErF2peFF-uhGsCvLatVyybgfKs8dgz0_",
"requestSize": "1460",
"status": 404,
"responseSize": "325",
"userAgent": "Wing-DSS",
"remoteIp": "74.125.210.1",
"serverIp": "10.64.2.3",
"latency": "19.837230s"
},
"resource": {
"type": "http_load_balancer",
"labels": {
"zone": "global",
"forwarding_rule_name": "k8s2-fs-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
"url_map_name": "k8s2-um-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
"backend_service_name": "k8s1-fd370d60-dss-main-core-service-8080-ff9079c6",
"target_proxy_name": "k8s2-ts-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
"project_id": "wing-utm-external-staging"
}
},
"timestamp": "2024-08-08T19:44:29.333991Z",
"severity": "WARNING",
"logName": "projects/wing-utm-external-staging/logs/requests",
"trace": "projects/wing-utm-external-staging/traces/231d30f78a0e53caba232f61d166213b",
"receiveTimestamp": "2024-08-08T19:44:50.565327121Z",
"spanId": "3826e6628999bceb"
}
Thank you for the details, @callumdmay.
Regarding the request and response you provided, they don't seem to contain the payload that was sent and received from the DSS, but instead contain what seems to be metadata pertaining to the system that is issuing the requests, or through which the requests flow?
The only directly relevant details they contain are the requestMethod
, requestUrl
and status
in the httpRequest
object.
Would it be possible for you to obtain the json payload that is sent in the first PUT, as well as the json response that is received by the DELETE
that ends in a 404?
I was going to say the issue could be that the op intent might not have been at the specified OVN, but actually I think we have a bug where the specified OVN is not checked when deleting an operational intent reference; @Shastick could you track this as a separate bug fix? We should return 404 (or possibly 409) with a helpful message if the op intent exists but the OVN doesn't match.
So, it seems like the issue is definitely that op intent 0a5d60ef-cbd3-46a5-9241-200ded546fba didn't exist at the time of the second request. @callumdmay are you sure there was no previous attempt to delete this operational intent? It would be extremely useful to search the DSS logs for instances of 0a5d60ef-cbd3-46a5-9241-200ded546fba between roughly 2024-08-08T19:44:27.286542Z and 2024-08-08T19:44:29.333991Z -- in addition to the two requests you've indicated above, I expect there is likely a third request to delete the operational intent reference between those two requests.
I only see one delete:
The log search is by OID, so all requests should appear
Are those the DSS logs or the client USS logs? I'm expecting log messages formatted like this.
Describe the bug During load testing of our system simulating simultaneous intent creation originating at the same point at ~ 3QPS, we observed that subsequent deletion of that intent returned 404 errors from the DSS. On observation of outbound requests from our system, we only made the Delete request once. After confirming with @BradNicolle it appears that the delete was successful inside the DSS, even though a 404 was returned.
To Reproduce There was nothing particularly special about our load testing configuration, it was 3 worker threads make Create() and then Delete() requests at random points in 1 second intervals, simulating ~3QPS. Each intent had an initial volume overlapping at the same geographical point, but at non-overlapping times.
Expected behavior The DSS should return a 2XX response if a deletion was successful.
DSS URL: https://dss.us.pre-qual.interuss.opensky.dev Operator: Wing Intent ID: "0a5d60ef-cbd3-46a5-9241-200ded546fba" OVN: "huHqtAVBkqIgErF2peFF-uhGsCvLatVyybgfKs8dgz0_" Outbound request timestamp: 2024-08-08 12:44:29.974325-0700