interuss / dss

InterUSS Platform's implementation of the ASTM DSS concept for RID and flight coordination.
Apache License 2.0
122 stars 86 forks source link

[scd] Simultaneous creation of temporally proximate intents seems to cause DSS to return 404 on DELETE even when intent is present and successfully deleted #1069

Open callumdmay opened 1 month ago

callumdmay commented 1 month ago

Describe the bug During load testing of our system simulating simultaneous intent creation originating at the same point at ~ 3QPS, we observed that subsequent deletion of that intent returned 404 errors from the DSS. On observation of outbound requests from our system, we only made the Delete request once. After confirming with @BradNicolle it appears that the delete was successful inside the DSS, even though a 404 was returned.

To Reproduce There was nothing particularly special about our load testing configuration, it was 3 worker threads make Create() and then Delete() requests at random points in 1 second intervals, simulating ~3QPS. Each intent had an initial volume overlapping at the same geographical point, but at non-overlapping times.

Expected behavior The DSS should return a 2XX response if a deletion was successful.

DSS URL: https://dss.us.pre-qual.interuss.opensky.dev Operator: Wing Intent ID: "0a5d60ef-cbd3-46a5-9241-200ded546fba" OVN: "huHqtAVBkqIgErF2peFF-uhGsCvLatVyybgfKs8dgz0_" Outbound request timestamp: 2024-08-08 12:44:29.974325-0700

mickmis commented 3 weeks ago

Hi @callumdmay,

Thanks for your report. Could you clarify a few things to help us understand the issue? Notably:

Thank you

callumdmay commented 2 weeks ago

What DSS version are you using? v0.16.0-rc1-ce9e97d

What's the overall setup of your DSS pool?

Setup of the pool is a single cluster on GKE, using the default Tanka config (i.e. 3 pods of each of the core-service and cockroachdb).

Any chance the operational intent has expired between the time it was created and the time it was tentatively deleted?

No these actions were only 2 seconds apart

Are you able to reliably reproduce the issue, does it appear randomly, or did it happen only once?

I was able to reproduce it with a parallel load qps above 3 QPS

Do you have any logs at hand relevant to the issue?

Are you able to view https://cloudlogging.app.goo.gl/9jStWPozE5c7vHkC8?

image

Do you have a dump of the culprit request and response? Among other, the message in the 404 response could be interesting.

Request:

{
  "insertId": "funxwmf5jv4vj",
  "jsonPayload": {
    "statusDetails": "response_sent_by_backend",
    "@type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
    "cacheDecision": [
      "RESPONSE_HAS_CONTENT_TYPE",
      "REQUEST_HAS_AUTHORIZATION",
      "CACHE_MODE_USE_ORIGIN_HEADERS"
    ],
    "remoteIp": "74.125.210.2",
    "backendTargetProjectNumber": "projects/229325454109"
  },
  "httpRequest": {
    "requestMethod": "PUT",
    "requestUrl": "https://dss.wing-internal.interuss.opensky.dev/dss/v1/operational_intent_references/0a5d60ef-cbd3-46a5-9241-200ded546fba",
    "requestSize": "99894",
    "status": 201,
    "responseSize": "1948",
    "userAgent": "Wing-DSS",
    "remoteIp": "74.125.210.2",
    "serverIp": "10.64.1.4",
    "latency": "1.176242s"
  },
  "resource": {
    "type": "http_load_balancer",
    "labels": {
      "url_map_name": "k8s2-um-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "project_id": "wing-utm-external-staging",
      "forwarding_rule_name": "k8s2-fs-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "target_proxy_name": "k8s2-ts-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "backend_service_name": "k8s1-fd370d60-dss-main-core-service-8080-ff9079c6",
      "zone": "global"
    }
  },
  "timestamp": "2024-08-08T19:44:27.286542Z",
  "severity": "INFO",
  "logName": "projects/wing-utm-external-staging/logs/requests",
  "trace": "projects/wing-utm-external-staging/traces/0a277f79d85bc9aabeb827f26e9931eb",
  "receiveTimestamp": "2024-08-08T19:44:30.770288499Z",
  "spanId": "3c8b2db3a4a424e8"
}

Response:

{
  "insertId": "8musn9f5y0xt9",
  "jsonPayload": {
    "remoteIp": "74.125.210.1",
    "@type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
    "backendTargetProjectNumber": "projects/229325454109",
    "statusDetails": "response_sent_by_backend",
    "cacheDecision": [
      "RESPONSE_HAS_CONTENT_TYPE",
      "REQUEST_HAS_AUTHORIZATION",
      "CACHE_MODE_USE_ORIGIN_HEADERS"
    ]
  },
  "httpRequest": {
    "requestMethod": "DELETE",
    "requestUrl": "https://dss.wing-internal.interuss.opensky.dev/dss/v1/operational_intent_references/0a5d60ef-cbd3-46a5-9241-200ded546fba/huHqtAVBkqIgErF2peFF-uhGsCvLatVyybgfKs8dgz0_",
    "requestSize": "1460",
    "status": 404,
    "responseSize": "325",
    "userAgent": "Wing-DSS",
    "remoteIp": "74.125.210.1",
    "serverIp": "10.64.2.3",
    "latency": "19.837230s"
  },
  "resource": {
    "type": "http_load_balancer",
    "labels": {
      "zone": "global",
      "forwarding_rule_name": "k8s2-fs-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "url_map_name": "k8s2-um-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "backend_service_name": "k8s1-fd370d60-dss-main-core-service-8080-ff9079c6",
      "target_proxy_name": "k8s2-ts-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "project_id": "wing-utm-external-staging"
    }
  },
  "timestamp": "2024-08-08T19:44:29.333991Z",
  "severity": "WARNING",
  "logName": "projects/wing-utm-external-staging/logs/requests",
  "trace": "projects/wing-utm-external-staging/traces/231d30f78a0e53caba232f61d166213b",
  "receiveTimestamp": "2024-08-08T19:44:50.565327121Z",
  "spanId": "3826e6628999bceb"
}
Shastick commented 2 weeks ago

Thank you for the details, @callumdmay.

Regarding the request and response you provided, they don't seem to contain the payload that was sent and received from the DSS, but instead contain what seems to be metadata pertaining to the system that is issuing the requests, or through which the requests flow?

The only directly relevant details they contain are the requestMethod, requestUrl and status in the httpRequest object.

Would it be possible for you to obtain the json payload that is sent in the first PUT, as well as the json response that is received by the DELETE that ends in a 404?

BenjaminPelletier commented 2 weeks ago

I was going to say the issue could be that the op intent might not have been at the specified OVN, but actually I think we have a bug where the specified OVN is not checked when deleting an operational intent reference; @Shastick could you track this as a separate bug fix? We should return 404 (or possibly 409) with a helpful message if the op intent exists but the OVN doesn't match.

So, it seems like the issue is definitely that op intent 0a5d60ef-cbd3-46a5-9241-200ded546fba didn't exist at the time of the second request. @callumdmay are you sure there was no previous attempt to delete this operational intent? It would be extremely useful to search the DSS logs for instances of 0a5d60ef-cbd3-46a5-9241-200ded546fba between roughly 2024-08-08T19:44:27.286542Z and 2024-08-08T19:44:29.333991Z -- in addition to the two requests you've indicated above, I expect there is likely a third request to delete the operational intent reference between those two requests.

callumdmay commented 1 week ago

I only see one delete:

image
callumdmay commented 1 week ago

The log search is by OID, so all requests should appear

BenjaminPelletier commented 1 week ago

Are those the DSS logs or the client USS logs? I'm expecting log messages formatted like this.