Implement TTLs on ended SCD subscriptions to prevent unbounded growth and latency increases

jctrouble commented 3 weeks ago

After applying the changes in #1070, this query:

https://github.com/interuss/dss/blob/915d45b444e1723e93ada503b9b0a340b619c591/pkg/scd/store/cockroach/subscriptions.go#L286-L297

still needs to read a lot of rows (analysis) to do its job:

planning time: 654µs
execution time: 121ms
distribution: local
vectorized: true
rows decoded from KV: 2,056 (1.5 MiB, 3 gRPC calls)
cumulative time spent in KV: 119ms
maximum memory usage: 3.7 MiB
network usage: 0 B (0 messages)
sql cpu time: 10ms
client time: 18µs
isolation level: serializable
priority: normal
quality of service: regular

• limit
│ count: 10000
│
└── • filter
    │ nodes: n3
    │ actual row count: 2
    │ sql cpu time: 82µs
    │ estimated row count: 507
    │ filter: COALESCE(starts_at <= '2024-08-20 23:28:08.865691+00', true) AND COALESCE(ends_at >= '2024-08-20 23:21:43.000993+00', true)
    │
    └── • index join (streamer)
        │ nodes: n3
        │ actual row count: 736
        │ KV time: 115ms
        │ KV contention time: 0µs
        │ KV rows decoded: 736
        │ KV bytes read: 1.4 MiB
        │ KV gRPC calls: 2
        │ estimated max memory allocated: 3.4 MiB
        │ estimated max sql temp disk usage: 0 B
        │ sql cpu time: 9ms
        │ estimated row count: 641
        │ table: scd_subscriptions@primary
        │
        └── • inverted filter
            │ nodes: n3
            │ actual row count: 736
            │ estimated max memory allocated: 60 KiB
            │ estimated max sql temp disk usage: 0 B
            │ sql cpu time: 1ms
            │ estimated row count: 641
            │ inverted column: cells_inverted_key
            │ num spans: 4
            │
            └── • scan
                  nodes: n3
                  actual row count: 1,320
                  KV time: 3ms
                  KV contention time: 0µs
                  KV rows decoded: 1,320
                  KV bytes read: 70 KiB
                  KV gRPC calls: 1
                  estimated max memory allocated: 130 KiB
                  sql cpu time: 262µs
                  estimated row count: 641 (4.7% of the table; stats collected 2 hours ago)
                  table: scd_subscriptions@cell_idx
                  spans: 4 spans

While it uses the cell_idx for the spatial query, it must still search through every subscription that ever lived for that S2 cell, just to filter them out by start and end dates. With time (and subscriptions not getting cleaned up) this query will continue to get slower and slower and slower as it needs to filter through more and more (likely irrelevant) rows.

The obvious approach would be to set up an index for this case. Unfortunately in a cursory investigation I was unable to get an inverted index working that contained {starts_at, ends_at, cells}. I'm not sure CRDB supports the kind of index we'd need to be able to search using all three columns.

The next best thing would be to prune expired subscriptions, such that the table doesn't experience unbounded growth. Cockroach supports row-level TTLs, such that we could purge any subscriptions whose end date is sufficiently in the past. In a local analysis on Wing's internal DSS, I found the following:

scd> SELECT COUNT(*) FROM scd_subscriptions;
  count
---------
  17404

scd> SELECT COUNT(*) FROM scd_subscriptions WHERE ends_at <= now() - INTERVAL '7 days';
  count
---------
  10774

over 60% of subscriptions had an end date of more than a week ago. By choosing an interval and pruning, we can ensure that the subscriptions table only grows as big as there are active subscriptions, and the queries don't face unbounded data sets.

mickmis commented 3 weeks ago

Thanks for the report @jctrouble. Some clarifying questions that will help:

Are all of those dangling subscriptions implicit?
Are all of those dangling subscriptions attached to an (expired but still stored) operation intent?

Also, here is a discussion relevant to this issue: interuss/tsc#12. However it seems like performance impact were not anticipated.

Maybe we need to start removing subscriptions and operational intents that are expired for a certain amount of time. Or maybe we can somehow find a way to set them aside in store queries so that they don't impact performances. Thoughts @BenjaminPelletier @barroco ?

jctrouble commented 3 weeks ago

@mickmis I'm still wrapping my head around the system, so forgive the dumb question: Is there an easy way to tell if a subscription is implicit / attached to an intent?

BenjaminPelletier commented 3 weeks ago

At a basic level, the discovery and synchronization service is closer to a real-time system (reflects current state of the world) than it is to a long-term storage system -- discovering flight information is generally only needed a limited amount of time in advance (30 or 56 days depending on context from F3548-21), and only has value in very narrow cases past the time of flight. So, I would expect us to want effective TTLs on pretty much all data -- there is nothing in the DSS we would want to keep for a year, for instance. So, the question would only be how long the TTLs should be. There's probably some discussion there, but I think it would be hard to argue anything longer than 56*2 days, and we would probably want shorter. Since DSS deployments last longer than 112 days, TTLs (even very loose ones) seem valuable for the long-term health of the system.

I believe the implicitness of the Subscription ("implicit" means the USS created or updated an operational intent and, instead of providing a pre-existing subscription to maintain airspace awareness for that operational intent, asked the DSS to maintain a subscription on the USS's behalf -- IMO this is an unnecessary complication, but it is in the standard) is an explicit column for the Subscription row. ASTM F3548-21 SCD0080 requires USSs to "maintain awareness" when they have an active operational intent. The meaning and interpretation of this gets murky when the wall time is past the end of the operational intent, and this was some of the motivation for the discussion @mickmis linked. Whether any operational intents are using a subscription can be determined by checking the operational intent rows for their subscription_id. Again, the "correct" treatment of subscriptions attached to operational intents that ended in the past is murky; see linked discussion for more.

As an additional note, I believe the resolution of #1056 (mostly pending in #1057) should likely go a long way to reducing the build up of subscriptions in the database.

jctrouble commented 3 weeks ago

Ran a bunch of queries to categorize implicit vs explicit subscriptions, whether or not they're associated with an existing operation, and whether or not they expired over a week ago:

Implicit?	Operation?	Recent	Old (end date more than 7 days ago)
✓	✓	67	63
✓	✗	10995	10861
✗	✓	0	243
✗	✗	33	457

The queries were variations of:

SELECT
    COUNT(sub.id)
FROM scd_subscriptions AS sub
LEFT OUTER JOIN scd_operations AS op
    ON sub.id = op.subscription_id
WHERE
    op.id IS NOT NULL 
    AND sub.implicit = TRUE
    AND sub.ends_at <= now() - INTERVAL '7 days'

mickmis commented 3 weeks ago

Thanks a lot for the data @jctrouble, that's very useful insight and answers all of my previous questions.

As an additional note, I believe the resolution of https://github.com/interuss/dss/issues/1056 (mostly pending in https://github.com/interuss/dss/pull/1057) should likely go a long way to reducing the build up of subscriptions in the database.

Indeed, resolution of #1056 will address the issue that led to the accrual of the 10995 implicit subscriptions without operational intent reported above.

However it will not remove all the existing ones. From your feedback @BenjaminPelletier it seems clear that we will need some way to cleanup the database, but also that defining what to cleanup is not trivial. The different ways forward I see:

add to the DSS a parametrizable CLI tool (used manually at first and possibly set up as cron job in cluster)
add to the DSS runtime a configurable cleanup routine
add TTL at CRDB level (as suggested in this issue in the first place)

Given the federated nature of the DSS, and the approximate definition of what a cleanup should look like, I do feel that option 1 enables the most flexibility for operations of a DSS pool. WDYT?

BenjaminPelletier commented 2 weeks ago

I don't think 1-3 are mutually exclusive, so it's probably most valuable to figure out if/when we want to do each thing separately. It seems like we need option 1 to migrate existing deployments in any case (even if we had 3 already), so that does seem like the best first step. I would try to stay away from option 2 if possible as I think that adds more moving parts than are necessary. After completing the manual version of option 1, the next question would probably be whether to make a cron job for option 1 or pursue option 3. Given the likely CRDB transition, I think we should probably defer that decision until after the manual version of option 1 to give more time to see what the CRDB plan will be (and whether row-level TTLs will be available in the new solution).

@mickmis, do you think you would have time to work on this in the near term? Note that deployment_manager exists, though may or may not be helpful for this project.

mickmis commented 2 weeks ago

@BenjaminPelletier yes, I will start working on it.

mickmis commented 2 weeks ago

Note that deployment_manager exists, though may or may not be helpful for this project.

@BenjaminPelletier IIUC doing so through deployment_manager implies being able to perform those operations only if the DSS is deployed on a kubernetes cluster. So that would exclude any local deployment using simple docker-compose files. Is that desirable?

mickmis commented 1 day ago

FYI in PR #1116 there is an implementation (still awaiting review) of a CLI tool that enables deletion of old operational intents and subscriptions.

interuss / dss

Implement TTLs on ended SCD subscriptions to prevent unbounded growth and latency increases #1074