solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 434 forks source link

gateway-proxy memory increased when making any change on any configuration #8900

Open edubonifs opened 9 months ago

edubonifs commented 9 months ago

Gloo Edge Product

Enterprise

Gloo Edge Version

1.14.0

Kubernetes Version

1.25

Describe the bug

We have a customer which is seeing their gateway-proxy memory increased when making any change on any configuration, for example modifying a VirtualService. When they make a modification, they see its memory increasing and not releasing, for example going from 8GB to 11GB by just making a change. They have spikes up to 10.8 GB on the envoy_server_memory_physical_size, 11.5 GB on the envoy_server_memory_heap_size and 10.2 GB on the POD level.

gateway-proxy-pod-metrics (1) envoy-memory-metrcs (1)

I have captured one of their heaps before making the modification (envoy.heap) and after the modification envoy-after.heap, in heaps.zip file

heaps.zip

It is worth to say that they are using was in heavily in most of their VirtualServices:

      waf:
        auditLogging:
          action: RELEVANT_ONLY
          location: FILTER_STATE
        coreRuleSet:
          customSettingsString: |
            SecAction "id:900200,phase:1,nolog,pass,t:none,setvar:'tx.allowed_methods=GET HEAD POST PUT PATCH DELETE OPTIONS'"
            SecAction "id:900230,phase:1,nolog,pass,t:none,setvar:'tx.allowed_http_versions=HTTP/2 HTTP/2.0 HTTP/1.1'"
            SecAction "id:900280,phase:1,nolog,pass,t:none,setvar:'tx.allowed_request_content_type_charset=utf-8'"
            SecAction "id:900350,phase:1,nolog,pass,t:none,setvar:'tx.combined_file_sizes=20971520'"
            SecAction "id:900990,phase:1,nolog,pass,t:none,setvar:'tx.crs_setup_version=320'"
            SecAuditLogFormat JSON
            SecAuditLogRelevantStatus "^(?:5|4(?!04))"
            SecDefaultAction "phase:1,log,auditlog,deny,status:403"
            SecDefaultAction "phase:2,log,auditlog,deny,status:403"
            SecRequestBodyAccess On
            SecRuleEngine On
            SecRuleUpdateTargetById 921130 "!REQUEST_COOKIES"
        customInterventionMessage: ModSecurity Intervention! APIGateway WAF has detected

And they have validation disabled at gateway-proxy level, you can check settings.yaml:

apiVersion: gloo.solo.io/v1
kind: Settings
metadata:
  labels:
    app: gloo
    gloo: settings
  name: default
  namespace: gloo-unstable-fbk
spec:
  gloo:
    xdsBindAddr: "0.0.0.0:9977"
    restXdsBindAddr: "0.0.0.0:9976"
    proxyDebugBindAddr: "0.0.0.0:9966"
    enableRestEds: false
    invalidConfigPolicy:
      invalidRouteResponseBody: Gloo Gateway has invalid configuration. Please contact Crealogix
        support for troubleshooting assistance.
      invalidRouteResponseCode: 500
      replaceInvalidRoutes: true
    disableKubernetesDestinations: true
    disableProxyGarbageCollection: false
  discoveryNamespace: gloo-unstable-fbk
  kubernetesArtifactSource: {}
  kubernetesConfigSource: {}
  kubernetesSecretSource: {}
  refreshRate: 60s
​
  gateway:
    isolateVirtualHostsBySslConfig: false
    readGatewaysFromAllNamespaces: false
    persistProxySpec: true
    compressedProxySpec: true
    enableGatewayController: true
  discovery:
    fdsMode: WHITELIST   
​
​
  extauth:
    transportApiVersion: V3
    extauthzServerRef:
      # arbitrarily default to the standalone deployment name even if we're using both
      name: extauth
      namespace: gloo-unstable-fbk
    requestTimeout: "10s"
    userIdHeader: "x-user-id"
  ratelimitServer:
    rateLimitBeforeAuth: false
    ratelimitServerRef:
      name: rate-limit
      namespace: gloo-unstable-fbk  
  consoleOptions:
    readOnly: false
    apiExplorerEnabled: true
  graphqlOptions:
    schemaChangeValidationOptions:
      rejectBreakingChanges: false
      processingRules: []

When they make any kind of change, waf rules are logged in the gateway-proxy:

[2023-11-13 16:14:59.802][7][debug][filter] [source/extensions/filters/http/modsecurity/config.cc:75] loading directory /modsecurity/rules/crs:
[2023-11-13 16:14:59.803][7][debug][filter] [source/extensions/filters/http/modsecurity/config.cc:84] -> loaded file /modsecurity/rules/crs/REQUEST-903.9001-DRUPAL-EXCLUSION-RULES.conf
[2023-11-13 16:14:59.804][7][debug][filter] [source/extensions/filters/http/modsecurity/config.cc:84] -> loaded file /modsecurity/rules/crs/REQUEST-901-INITIALIZATION.conf

[2023-11-13 16:14:59.859][7][debug][filter] [source/extensions/filters/http/modsecurity/config.cc:93] loaded string rule:
SecRuleEngine On
SecAuditLogFormat JSON
SecAuditLogRelevantStatus "^(?:5|4(?!04))"
SecRequestBodyAccess On
SecDefaultAction "phase:1,log,auditlog,deny,status:403"
SecDefaultAction "phase:2,log,auditlog,deny,status:403"
SecAction "id:900200,phase:1,nolog,pass,t:none,setvar:'tx.allowed_methods=GET HEAD POST PUT PATCH DELETE OPTIONS'"
SecAction "id:900230,phase:1,nolog,pass,t:none,setvar:'tx.allowed_http_versions=HTTP/2 HTTP/2.0 HTTP/1.1'"
SecAction "id:900280,phase:1,nolog,pass,t:none,setvar:'tx.allowed_request_content_type_charset=utf-8
SecAction "id:900350,phase:1,nolog,pass,t:none,setvar:'tx.combined_file_sizes=20971520'"
SecAction "id:900990,phase:1,nolog,pass,t:none,setvar:'tx.crs_setup_version=320'"
SecRuleUpdateTargetById 921130 "!REQUEST_COOKIES"

And they just see these errors, but I would say it is more related to modsec:

{"level":"error","ts":"2023-11-13T16:04:08.454Z","logger":"gloo-ee.v1.event_loop.setup.v1.event_loop.syncer.kubernetes_eds","caller":"kubernetes/eds.go:208","msg":"upstream gloo: port 8080 not found for service upstream-svc","version":"1.14.0","stacktrace":"github.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).List\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.14.0/projects/gloo/pkg/plugins/kubernetes/eds.go:208\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func1\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.14.0/projects/gloo/pkg/plugins/kubernetes/eds.go:229\ngithub.com/solo-io/gloo/projects/gloo/pkg/plugins/kubernetes.(*edsWatcher).watch.func2\n\t/go/pkg/mod/github.com/solo-io/gloo@v1.14.0/projects/gloo/pkg/plugins/kubernetes/eds.go:256"}

{"level":"error","ts":1699885178.0106442,"logger":"rate-limiter","msg":"finished unary call with code Unknown","version":"undefined","grpc.start_time":"2023-11-13T14:19:38Z","grpc.request.deadline":"2023-11-13T14:19:38Z","system":"grpc","span.kind":"server","grpc.service":"envoy.service.ratelimit.v3.RateLimitService","grpc.method":"ShouldRateLimit","error":"Could not execute Redis pipeline: context canceled","errorVerbose":"Could not execute Redis pipeline\n\tgrpc.(*Server).handleStream:/go/pkg/mod/google.golang.org/grpc@v1.52.0/server.go:1704\n\tgrpc.(*Server).processUnaryRPC:/go/pkg/mod/google.golang.org/grpc@v1.52.0/server.go:1336\n\tv3._RateLimitService_ShouldRateLimit_Handler:/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.11.0/envoy/service/ratelimit/v3/rls.pb.go:996\n\tgo-grpc-middleware.ChainUnaryServer.func1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\n\tgo-grpc-middleware.ChainUnaryServer.func1.1.1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n\tzap.UnaryServerInterceptor.func1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:31\n\tgo-grpc-middleware.ChainUnaryServer.func1.1.1:/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n\thealthchecker.GrpcUnaryServerHealthCheckerInterceptor.func1:/go/pkg/mod/github.com/solo-io/go-utils@v0.24.0/healthchecker/grpc.go:69\n\tv3._RateLimitService_ShouldRateLimit_Handler.func1:/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.11.0/envoy/service/ratelimit/v3/rls.pb.go:994\n\tservice.(*service).ShouldRateLimit:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/service/ratelimit.go:223\n\tservice.(*service).shouldRateLimitWorker:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/service/ratelimit.go:171\n\tredis.(*rateLimitCacheImpl).DoLimit:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/cache/redis/cache_impl.go:149\n\tredis.pipelineFetch:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/cache/redis/cache_impl.go:48\n\tredis.(*connectionImpl).Pop:/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/cache/redis/driver_impl.go:146\ncontext canceled","grpc.code":"Unknown","grpc.time_ms":0.445,"stacktrace":"github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.DefaultMessageProducer\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/options.go:212\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngithub.com/envoyproxy/go-control-plane/envoy/service/ratelimit/v3._RateLimitService_ShouldRateLimit_Handler\n\t/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.11.0/envoy/service/ratelimit/v3/rls.pb.go:996\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.52.0/server.go:1336\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.52.0/server.go:1704\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.52.0/server.go:965"}
{"level":"dpanic","ts":1699885303.8779607,"logger":"rate-limiter","msg":"Could not execute Redis pipeline: context canceled","version":"undefined","stacktrace":"github.com/solo-io/rate-limiter/pkg/cache/redis.(*connectionImpl).Pop\n\t/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/cache/redis/driver_impl.go:145\ngithub.com/solo-io/rate-limiter/pkg/cache/redis.pipelineFetch\n\t/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/cache/redis/cache_impl.go:48\ngithub.com/solo-io/rate-limiter/pkg/cache/redis.(*rateLimitCacheImpl).DoLimit\n\t/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/cache/redis/cache_impl.go:149\ngithub.com/solo-io/rate-limiter/pkg/service.(*service).shouldRateLimitWorker\n\t/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/service/ratelimit.go:171\ngithub.com/solo-io/rate-limiter/pkg/service.(*service).ShouldRateLimit\n\t/go/pkg/mod/github.com/solo-io/rate-limiter@v0.8.0/pkg/service/ratelimit.go:223\ngithub.com/envoyproxy/go-control-plane/envoy/service/ratelimit/v3._RateLimitService_ShouldRateLimit_Handler.func1\n\t/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.11.0/envoy/service/ratelimit/v3/rls.pb.go:994\ngithub.com/solo-io/go-utils/healthchecker.GrpcUnaryServerHealthCheckerInterceptor.func1\n\t/go/pkg/mod/github.com/solo-io/go-utils@v0.24.0/healthchecker/grpc.go:69\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.UnaryServerInterceptor.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/logging/zap/server_interceptors.go:31\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngithub.com/envoyproxy/go-control-plane/envoy/service/ratelimit/v3._RateLimitService_ShouldRateLimit_Handler\n\t/go/pkg/mod/github.com/envoyproxy/go-control-plane@v0.11.0/envoy/service/ratelimit/v3/rls.pb.go:996\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/grpc@v1.52.0/server.go:1336\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/grpc@v1.52.0/server.go:1704\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/go/pkg/mod/google.golang.org/grpc@v1.52.0/server.go:965"}

They are running GE 1.14.0, I guess if every time we make a modification in a VS, all waf rules from every VS are loaded again in the gateway-proxy, or just asking myself how can be this possible, as memory grows when they change anything, but we have validation disabled

They also have discovery disabled:

gloo-ee:
  gloo:
    discovery:
      enabled: false

And

disableKubernetesDestinations: true

Expected Behavior

Making config changes shouldn't cause this memory growths, maybe it is waf rules that are reloaded every time the config changes

Additional Environment Detail

No response

Additional Context

No response

github-actions[bot] commented 2 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.