interuss / dss

InterUSS Platform's implementation of the ASTM DSS concept for RID and flight coordination.
Apache License 2.0
122 stars 86 forks source link

[SCD] Enable USS to propose OVN to increase parallelization #1078

Open LeslieW opened 3 weeks ago

LeslieW commented 3 weeks ago

Implementation tasks

Original issue

Is your feature request related to a problem? Please describe. Recent load testing for SCD revealed a limitation: a USS may need their OVN in internal storage before it is received and processed from the DSS. The following scenarios come to mind:

Describe the solution you'd like It would be helpful if each USS could optionally provide an OVN to the DSS. Perhaps something similar to below:

Each USS would have a current and proposed version of an operational intent. By default the {operational-intent-id}/ endpoint would return the current version. A proposed version becomes the current version when the response is received from the DSS, or if another USS specifically requests that version (meaning the DSS response is still pending or was lost, but the DSS accepted the request). Each USS would additionally have a {operational-intent-id}/{version} endpoint that would return either the current or proposed version. A USS is not required to store previous versions beyond current and proposed.

Visual resources ovn_proposal

Describe alternatives you've considered Alternatives seem limited as this is a distributed system -- without this distributed transaction a USS can only do a best effort reconciliation which increases load on the DSS or latency when responding to other USSs.

Additional context @BenjaminPelletier and @Callumdmay may have additional context (much of the above was pulled from a conversation we had).

BenjaminPelletier commented 2 weeks ago

It seems like there are two features mentioned here:

  1. Allow a USS to request a specific OVN (such that the USS knows what the OVN is likely to be and can begin effectively serving operational intent details before a response is received from the DSS)
  2. Enable USSs to get the details for the correct operational intent version even in the presence of a DSS request that is pending from the perspective of the serving USS

Number 2 is outside the scope of the DSS (the DSS doesn't do anything differently; only the USSs), so I'll assume it's outside the scope of this request. Number 1 seems like something we can do.

I'd expect we would go about this by doing something like:

  1. Modify the SCD database schema to include a nullable ovn column for each operational intent reference (populated with null for all existing operational intent references)
    1. Also add another past_ovns column containing an array of strings, initially empty
  2. Update retrieval handlers to populate the OVN from the ovn column when present, and otherwise default to the current approach when not present/null
  3. Fork the standard F3548 API repo under InterUSS
  4. Make an F3548-21-InterUSS branch
  5. Commit a change on that branch adding an optional requested_ovn_suffix string field to PutOperationalIntentReferenceParameters
    1. Specify that the suffix must be in the form of YYYYMMddhhmmss + unique random string (see below for reasoning)
  6. Update this dss repo to point at the new branch of the InterUSS fork as its API subrepo, and re-auto-generate Go objects from that updated API
  7. Update mutation handlers to populate the ovn column as "op intent ID + requested_ovn_suffix" when requested_ovn_suffix is present and matches the specified format (date-prefixed), as long as past_ovns doesn't contain that OVN (otherwise fall back to current method of computing OVN)
    1. Also add the new OVN to the past_ovns column

After these changes, a USS could include a requested_ovn_suffix in their request to create an operational intent reference. When creating that reference, the USS could immediately publish the operational intent details with expected OVN (op intent ID + requested suffix), and this would allow USSs to obtain correct op intent details even if the response from the DSS took a long time to arrive and/or be processed by the USS creating the operational intent.

After these changes, a USS could also include requested_ovn_suffix in their request to modify an operational intent reference. When updating that reference, the USS would need to follow the additional flow recommended above and have other USSs follow the additional flow as well -- the USS would still publish the old op intent details until DSS receipt was confirmed, either via DSS response or via explicit USS request for a newer version (to maintain backwards compatibility and standard compliance), but it would respond with the new details if asked for a specific version using the additional endpoint (outside the F3548-21 API).

@LeslieW do you think those changes would satisfy this request?

The reason to specify a suffix rather than the full OVN is that this allows the DSS to ensure global uniqueness without maintaining a global OVN database, while still allowing the USS to know what the OVN will be. If the USS specified the entire OVN, the DSS would have to verify the OVN didn't collide with any other OVNs for any operational intent, past or present, and that would be a pretty major change. By prefixing the OVN with the operational intent ID, we can scope the collision check to just that operational intent ID and therefore use the past_ovns column of the operational intent reference.

But, there is no guarantee that an operational intent ID will not be reused in the future (after the first operational intent is out of the system), so including the timestamp of the request as part of the suffix ensures global uniqueness of the OVN even if the operational intent ID is reused in the future. Requiring the USS to specify the timestamp as part of the suffix ensures that clock skew between USS and DSS and/or request latency will not cause problems -- the DSS can simply verify that the requested timestamp is within the accepted clock skew, which could perhaps be tens of seconds.

@barroco and @mickmis , do you see any issues with this approach, see any better ways to accomplish the goal, or have any other thoughts regarding these changes? It seems to me like this would make the system more resilient to DSS latency, and given customer observations of the impact of that, it seems like that additional resiliency would be worth having in the near term.

mickmis commented 2 weeks ago

About addressing the planning of many flights in succession, maybe batching of operational intent upsert could help? The implementation of a new endpoint doing so could relax the key checks to ignore the OVNs of the operational intents that are part of the batch.

About an USS requesting op intent details before the DSS response has been received/processed by the owning USS, how likely is that in practice? I see that happening only within this timeframe which should just be dependent on network latency:

Have you seen that happen a significant amount of time?

About your proposal @BenjaminPelletier I don't spot any obvious issue. It does seem to be retrocompatible with the standard and not have any impact for USSs that do not use this feature. Just some minor feedback:

callumdmay commented 2 weeks ago

Our traffic shape does not allow for batching unfortunately, so a batch endpoint would not be particularly useful to us here, however it's worth noting that our intents are usually not geospatially proximate, so the s2 cell sharding used by the DSS should be sufficient for our throughput needs.

About an USS requesting op intent details before the DSS response has been received/processed by the owning USS, how likely is that in practice? I see that happening only within this timeframe which should just be dependent on network latency:

This was revealed to us by the latency issue we saw in the DSS with the CRDB indexes not functioning correctly, however nevertheless it is a correctness issue only revealed by a latency issue. It could also occur if there was a network partition, server crash etc, where the response is never received by the USS after is has been accepted by the DSS. Our hope is to make the standard more resilient to the wide variety of failure conditions that any distributed systems encounters.

BenjaminPelletier commented 2 weeks ago

Should the addition of the '/version' endpoint for the USS be added as well? It's not included in your list of changes.

Ah, that's probably fair -- while there is no functional change to the DSS to support the second feature, defining how it should work for USSs would be valuable. So yes, in step 5, I think we'd make that API addition to the USS-USS endpoints as well.

About the suffix: this looks like this could be using a UUIDv7 to use a standardized approach.

This sounds like a good idea to me. @callumdmay or @LeslieW any concerns about having the suffix be a UUIDv7 specifically? The exchange would then look like this:

Have you seen that happen a significant amount of time?

Another time this would happen is if a USS client gives up due to response time when attempting to perform a DSS operation. For instance, if the USS has a 5-second operation deadline and the DSS takes 6 seconds to respond, the USS (in the 5-second timeframe) cannot distinguish the behavior of the DSS from an unsuccessful request, so it just assumes the operational intent reference doesn't exist (which leaves the operational intent reference in the DSS without the USS acknowledging its existence). With this proposed-OVN change, the USS could instead assume the operational intent reference does exist and put it in the queue for later positive cleanup, in the meantime serving the valid details to anyone who might happen to ask.

I think the big value here is correctness and robustness as @callumdmay mentions. Proposed OVNs allows a USS to begin serving usually-correct operational intent details prior to making the DSS call, eliminating any problems in the operational intent reference creation process as reasons that might block another USS from planning in the airspace.

@mickmis assuming no issues with UUIDv7 from @callumdmay or @LeslieW, do you think you might have time to work on this in the near term?

callumdmay commented 2 weeks ago

I think UUIDv7 is a great choice, +1 to that

mickmis commented 2 weeks ago

This was revealed to us by the latency issue we saw in the DSS with the CRDB indexes not functioning correctly, however nevertheless it is a correctness issue only revealed by a latency issue. It could also occur if there was a network partition, server crash etc, where the response is never received by the USS after is has been accepted by the DSS. Our hope is to make the standard more resilient to the wide variety of failure conditions that any distributed systems encounters.

Another time this would happen is if a USS client gives up due to response time when attempting to perform a DSS operation. For instance, if the USS has a 5-second operation deadline and the DSS takes 6 seconds to respond, the USS (in the 5-second timeframe) cannot distinguish the behavior of the DSS from an unsuccessful request, so it just assumes the operational intent reference doesn't exist (which leaves the operational intent reference in the DSS without the USS acknowledging its existence). With this proposed-OVN change, the USS could instead assume the operational intent reference does exist and put it in the queue for later positive cleanup, in the meantime serving the valid details to anyone who might happen to ask.

Indeed, that did not come to my mind!

@mickmis assuming no issues with UUIDv7 from @callumdmay or @LeslieW, do you think you might have time to work on this in the near term?

Yes I will start on it after I'm done with a first draft for #1074 which shouldn't take too long, unless you'd rather prioritize differently.