gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
16.98k stars 1.71k forks source link

Unified Resource Cache contains expired items #37527

Closed rosstimothy closed 4 months ago

rosstimothy commented 5 months ago

I added several thousand items to my cluster, kept them around for several hours, then removed them all. When I logged in to the cluster the next day it had 0 resources. I added one SSH service and logged in to the web UI and noticed that it was showing more than one. I confirmed that both tctl get and tsh ls only found a single resource. I confirmed that DynamoDB only contained a single resource. Both tctl and tsh still use ListResources while the web UI uses ListUnifiedResources.

./releases/15.0.0-beta.1/tctl get nodes --format=json | jq -r '.[].spec.hostname' | wc -l
       1

./releases/15.0.0-beta.1/tsh ls
Node Name                       Address    Labels
------------------------------- ---------- ----------------------------------------
node-agents-696999bc55-qtjw8-00 ⟵ Tunnel   fullname=node-agents-696999bc55-qtjw8-00

image

rosstimothy commented 5 months ago

This appears to be a side effect of relative node expiry not emitting delete events. I was able to reproduce the behavior by adding a unified resource cache to the relative expiry test and checking how many items existed after the relative expiry was performed.

diff --git a/lib/cache/cache_test.go b/lib/cache/cache_test.go
index 52491acf90..bc52e70484 100644
--- a/lib/cache/cache_test.go
+++ b/lib/cache/cache_test.go
@@ -36,6 +36,7 @@ import (
        "github.com/stretchr/testify/assert"
        "github.com/stretchr/testify/require"

+       "github.com/gravitational/teleport"
        "github.com/gravitational/teleport/api/client"
        "github.com/gravitational/teleport/api/client/proto"
        apidefaults "github.com/gravitational/teleport/api/defaults"
@@ -2496,6 +2497,17 @@ func TestRelativeExpiry(t *testing.T) {
        })
        t.Cleanup(p.Close)

+       urc, err := services.NewUnifiedResourceCache(ctx, services.UnifiedResourceCacheConfig{
+               ResourceWatcherConfig: services.ResourceWatcherConfig{
+                       QueueSize:    defaults.UnifiedResourcesQueueSize,
+                       Component:    teleport.ComponentUnifiedResource,
+                       Client:       p.cache,
+                       MaxStaleness: time.Minute,
+               },
+               ResourceGetter: p.cache,
+       })
+       require.NoError(t, err)
+
        // add servers that expire at a range of times
        now := clock.Now()
        for i := int64(0); i < nodeCount; i++ {
@@ -2515,6 +2527,10 @@ func TestRelativeExpiry(t *testing.T) {
        require.NoError(t, err)
        require.Len(t, nodes, 100)

+       res, err := urc.GetUnifiedResources(ctx)
+       require.NoError(t, err)
+       require.Len(t, res, 100)
+
        clock.Advance(time.Minute * 25)
        // get rid of events that were emitted before clock advanced
        drainEvents(p.eventsC)
@@ -2549,6 +2565,10 @@ func TestRelativeExpiry(t *testing.T) {
        nodes, err = p.cache.GetNodes(ctx, apidefaults.Namespace)
        require.NoError(t, err)
        require.NotEmpty(t, nodes, "node_count=%d", len(nodes))
+
+       res, err = urc.GetUnifiedResources(ctx)
+       require.NoError(t, err)
+       require.NotEqual(t, 100, len(res))
 }

 func TestRelativeExpiryLimit(t *testing.T) {

The test fails without the patch below because the unified resource cache is never told that anything was removed.

diff --git a/lib/cache/cache.go b/lib/cache/cache.go
index 9c2d04694c..d861d653bd 100644
--- a/lib/cache/cache.go
+++ b/lib/cache/cache.go
@@ -1440,7 +1440,7 @@ func (c *Cache) performRelativeNodeExpiry(ctx context.Context) error {
                                Kind:     types.KindNode,
                                Metadata: node.GetMetadata(),
                        },
-               }, false); err != nil {
+               }, true); err != nil {
                        return trace.Wrap(err)
                }

When relative expiry was initially implemented it was only run on the auth servers. However that caused lots of buffer overflow problems for downstream watchers because a batch of N deleted events could be emitted faster than they could be consumed downstream. To solve this problem, we stopped emitting delete events and had each cache instance run relative expiry. We may be able to revert to the initial implementation since FanoutV2 should solve the buffer overflow problems we originally ran in to.

It should also be noted that the unified resource cache is not the only thing impacted by this. Any watcher on a local cache may experience the same behavior described above, though it is limited just to nodes since relative expiry is not enabled for any other resource.