Open LeslieW opened 3 months ago
It seems like there are two features mentioned here:
Number 2 is outside the scope of the DSS (the DSS doesn't do anything differently; only the USSs), so I'll assume it's outside the scope of this request. Number 1 seems like something we can do.
I'd expect we would go about this by doing something like:
ovn
column for each operational intent reference (populated with null for all existing operational intent references)
past_ovns
column containing an array of strings, initially emptyovn
column when present, and otherwise default to the current approach when not present/nullrequested_ovn_suffix
string field to PutOperationalIntentReferenceParameters
dss
repo to point at the new branch of the InterUSS fork as its API subrepo, and re-auto-generate Go objects from that updated APIovn
column as "op intent ID + requested_ovn_suffix
" when requested_ovn_suffix
is present and matches the specified format (date-prefixed), as long as past_ovns
doesn't contain that OVN (otherwise fall back to current method of computing OVN)
past_ovns
columnAfter these changes, a USS could include a requested_ovn_suffix
in their request to create an operational intent reference. When creating that reference, the USS could immediately publish the operational intent details with expected OVN (op intent ID + requested suffix), and this would allow USSs to obtain correct op intent details even if the response from the DSS took a long time to arrive and/or be processed by the USS creating the operational intent.
After these changes, a USS could also include requested_ovn_suffix
in their request to modify an operational intent reference. When updating that reference, the USS would need to follow the additional flow recommended above and have other USSs follow the additional flow as well -- the USS would still publish the old op intent details until DSS receipt was confirmed, either via DSS response or via explicit USS request for a newer version (to maintain backwards compatibility and standard compliance), but it would respond with the new details if asked for a specific version using the additional endpoint (outside the F3548-21 API).
@LeslieW do you think those changes would satisfy this request?
The reason to specify a suffix rather than the full OVN is that this allows the DSS to ensure global uniqueness without maintaining a global OVN database, while still allowing the USS to know what the OVN will be. If the USS specified the entire OVN, the DSS would have to verify the OVN didn't collide with any other OVNs for any operational intent, past or present, and that would be a pretty major change. By prefixing the OVN with the operational intent ID, we can scope the collision check to just that operational intent ID and therefore use the past_ovns
column of the operational intent reference.
But, there is no guarantee that an operational intent ID will not be reused in the future (after the first operational intent is out of the system), so including the timestamp of the request as part of the suffix ensures global uniqueness of the OVN even if the operational intent ID is reused in the future. Requiring the USS to specify the timestamp as part of the suffix ensures that clock skew between USS and DSS and/or request latency will not cause problems -- the DSS can simply verify that the requested timestamp is within the accepted clock skew, which could perhaps be tens of seconds.
@barroco and @mickmis , do you see any issues with this approach, see any better ways to accomplish the goal, or have any other thoughts regarding these changes? It seems to me like this would make the system more resilient to DSS latency, and given customer observations of the impact of that, it seems like that additional resiliency would be worth having in the near term.
About addressing the planning of many flights in succession, maybe batching of operational intent upsert could help? The implementation of a new endpoint doing so could relax the key checks to ignore the OVNs of the operational intents that are part of the batch.
About an USS requesting op intent details before the DSS response has been received/processed by the owning USS, how likely is that in practice? I see that happening only within this timeframe which should just be dependent on network latency:
Have you seen that happen a significant amount of time?
About your proposal @BenjaminPelletier I don't spot any obvious issue. It does seem to be retrocompatible with the standard and not have any impact for USSs that do not use this feature. Just some minor feedback:
Our traffic shape does not allow for batching unfortunately, so a batch endpoint would not be particularly useful to us here, however it's worth noting that our intents are usually not geospatially proximate, so the s2 cell sharding used by the DSS should be sufficient for our throughput needs.
About an USS requesting op intent details before the DSS response has been received/processed by the owning USS, how likely is that in practice? I see that happening only within this timeframe which should just be dependent on network latency:
This was revealed to us by the latency issue we saw in the DSS with the CRDB indexes not functioning correctly, however nevertheless it is a correctness issue only revealed by a latency issue. It could also occur if there was a network partition, server crash etc, where the response is never received by the USS after is has been accepted by the DSS. Our hope is to make the standard more resilient to the wide variety of failure conditions that any distributed systems encounters.
Should the addition of the '/version' endpoint for the USS be added as well? It's not included in your list of changes.
Ah, that's probably fair -- while there is no functional change to the DSS to support the second feature, defining how it should work for USSs would be valuable. So yes, in step 5, I think we'd make that API addition to the USS-USS endpoints as well.
About the suffix: this looks like this could be using a UUIDv7 to use a standardized approach.
This sounds like a good idea to me. @callumdmay or @LeslieW any concerns about having the suffix be a UUIDv7 specifically? The exchange would then look like this:
requested_ovn_suffix
=019194b6-148e-7b10-a502-ceb104f8c83arequested_ovn_suffix
UUIDv7 is within acceptable clock skew/latencyHave you seen that happen a significant amount of time?
Another time this would happen is if a USS client gives up due to response time when attempting to perform a DSS operation. For instance, if the USS has a 5-second operation deadline and the DSS takes 6 seconds to respond, the USS (in the 5-second timeframe) cannot distinguish the behavior of the DSS from an unsuccessful request, so it just assumes the operational intent reference doesn't exist (which leaves the operational intent reference in the DSS without the USS acknowledging its existence). With this proposed-OVN change, the USS could instead assume the operational intent reference does exist and put it in the queue for later positive cleanup, in the meantime serving the valid details to anyone who might happen to ask.
I think the big value here is correctness and robustness as @callumdmay mentions. Proposed OVNs allows a USS to begin serving usually-correct operational intent details prior to making the DSS call, eliminating any problems in the operational intent reference creation process as reasons that might block another USS from planning in the airspace.
@mickmis assuming no issues with UUIDv7 from @callumdmay or @LeslieW, do you think you might have time to work on this in the near term?
I think UUIDv7 is a great choice, +1 to that
This was revealed to us by the latency issue we saw in the DSS with the CRDB indexes not functioning correctly, however nevertheless it is a correctness issue only revealed by a latency issue. It could also occur if there was a network partition, server crash etc, where the response is never received by the USS after is has been accepted by the DSS. Our hope is to make the standard more resilient to the wide variety of failure conditions that any distributed systems encounters.
Another time this would happen is if a USS client gives up due to response time when attempting to perform a DSS operation. For instance, if the USS has a 5-second operation deadline and the DSS takes 6 seconds to respond, the USS (in the 5-second timeframe) cannot distinguish the behavior of the DSS from an unsuccessful request, so it just assumes the operational intent reference doesn't exist (which leaves the operational intent reference in the DSS without the USS acknowledging its existence). With this proposed-OVN change, the USS could instead assume the operational intent reference does exist and put it in the queue for later positive cleanup, in the meantime serving the valid details to anyone who might happen to ask.
Indeed, that did not come to my mind!
@mickmis assuming no issues with UUIDv7 from @callumdmay or @LeslieW, do you think you might have time to work on this in the near term?
Yes I will start on it after I'm done with a first draft for #1074 which shouldn't take too long, unless you'd rather prioritize differently.
Implementation tasks
Original issue
Is your feature request related to a problem? Please describe. Recent load testing for SCD revealed a limitation: a USS may need their OVN in internal storage before it is received and processed from the DSS. The following scenarios come to mind:
Describe the solution you'd like It would be helpful if each USS could optionally provide an OVN to the DSS. Perhaps something similar to below:
Each USS would have a current and proposed version of an operational intent. By default the {operational-intent-id}/ endpoint would return the current version. A proposed version becomes the current version when the response is received from the DSS, or if another USS specifically requests that version (meaning the DSS response is still pending or was lost, but the DSS accepted the request). Each USS would additionally have a {operational-intent-id}/{version} endpoint that would return either the current or proposed version. A USS is not required to store previous versions beyond current and proposed.
Visual resources
Describe alternatives you've considered Alternatives seem limited as this is a distributed system -- without this distributed transaction a USS can only do a best effort reconciliation which increases load on the DSS or latency when responding to other USSs.
Additional context @BenjaminPelletier and @Callumdmay may have additional context (much of the above was pulled from a conversation we had).