jcchavezs / coraza-http-wasm-traefik

Apache License 2.0
38 stars 4 forks source link

High memory usage when using coraza plugin #9

Open mwantia opened 5 months ago

mwantia commented 5 months ago

I am currently trying to implement the coraza plugin into traefik, which sits behind a cloudflare tunnel for external access.

As soon as I activate the middleware for the services traefik starts using a lot of memory. I increased the allowed memory usage of traefik to 4 GB, which were immediately consumed after navigating two times. After the third, Traefik fails with an OOM exception and restarts.

I can't imagine that these kind of high memory usages are expected. There also seems to be an ongoing discussion about the same topic here, where Traefik even seems to consume about 32 GB of memory.

I removed most other configuration, since they shouldn't be relevant but this is the config I have the Traefik running with:

# /secrets/traefik.yaml
experimental:
  plugins:
    coraza:
      moduleName: github.com/jcchavezs/coraza-http-wasm-traefik
      version: v0.2.1
providers:
  file:
    directory: /local/config
# /local/config/waf.yaml
http:
  middlewares:
    waf:
      plugin:
        coraza:
          directives:
            - SecRuleEngine On
            - SecDebugLog /dev/stdout
#           - Include @owasp_crs/**.conf
#           - Include /directives/*.conf

I intentionally removed the other two include-directives, but even with such a barebone setting I receive an OOM after a handful of requests.

mwantia commented 5 months ago

Some updates:

I tinkered around with my setup, mostly adjusting the LogLevel, AccessLog and directives, but it seems to run smoothly between 2,5 to 2,8 GB right now. Currently unsure why it behaves like this, so I will have to experiment with my settings again later to see if there are any noticeable changes.

I also adjusted the LogLevel to debug, since all other options on coraza don't seem to change anything and noticed that the following output gets repeated nearly every few requests.

2024-05-31T08:26:19Z DBG github.com/traefik/traefik/v3/pkg/logs/wasm.go:31 > Initializing WAF with directives:
SecRuleEngine On
SecDebugLog /dev/stdout
SecDebugLogLevel 2
Include @crs-setup.conf.example
SecRule REQUEST_URI "@streq /xyz" "id:101,phase:1,log,deny,status:403"

Isn't this part of the main function and only used during initialization? I'm not that knowledgable in go programming and even less when it comes to Traefik plugins but I would assume to only see this log at the start but not during every request.

Additionally, this is the configuration Traefik is currently running with. I will try to see if there are any noticable changes or spices in usage during the weekend.

experimental:
  plugins:
    traefik-real-ip:
      modulename: github.com/soulbalz/traefik-real-ip
      version: v1.0.3
    geoblock:
      moduleName: github.com/PascalMinder/GeoBlock
      version: v0.2.2
    coraza:
      moduleName: github.com/jcchavezs/coraza-http-wasm-traefik
      version: v0.2.1

entrypoints:
  websecure:
    address: ':443'
    forwardedHeaders:
      insecure: true
    http:
      tls: true
      middlewares:
        - 'realip@file'
        - 'geoblock-de@file'
        - 'waf@file'

global:
  sendAnonymousUsage: false
  checkNewVersion: false

api:
  dashboard: true
  insecure: true

metrics:
  prometheus:
    addRoutersLabels: true
    addServicesLabels: true

ping: {}
log:
  level: DEBUG
accessLog: {}

providers:
  file:
    directory: /local/config
  consulcatalog:
    endpoint:
      address: 'consul.service.consul:8501'
      scheme: https
      token: '${CONSUL_TOKEN}'
      tls:
        insecureSkipVerify: false
    connectAware: true
    connectByDefault: true
    exposedByDefault: false
    defaultRule: 'Host(`{{ .Name }}.${DOMAIN}`)'
    constraints: 'TagRegex(`cloudflare.enable=true`)'
http:
 middlewares:
   waf:
     plugin:
       coraza:
         directives:
           - SecRuleEngine On
           - SecDebugLog /dev/stdout
           - SecDebugLogLevel 9
           - Include @crs-setup.conf.example
           - SecRule REQUEST_URI "@streq /xyz" "id:101,phase:1,log,deny,status:403" # Testing
#          - Include @owasp_crs/**.conf
           - Include /directives/*.conf
attrib commented 5 months ago

Can confirm this issue.

With

    waf:
      plugin:
        coraza:
          directives:
            - SecRuleEngine On
            - SecDebugLog /dev/stdout
            - SecDebugLogLevel 9
            - SecRule REQUEST_URI "@streq /wp-admin" "id:101,phase:1,log,deny,status:403"   

traefik process uses a bit under 2GB, without it 100MB.

elkinaguas commented 5 months ago

Hello, I noticed a similar behavior using Traefik in binary mode and the Coraza middleware. Here is my experience in case it can be useful to someone.

The System's RAM usage before starting Traefik with Coraza is 2,4G and after starting it 2,7G.

First test: I set a Python server behind Traefik and using another Python script I sent 100 requests (which will reach the Python server) to the Traefik entrypoint, with a 100ms sleep time between requests. After this test the RAM increased to 4.8G. What is interesting in my opinion is that the RAM doesn't seem to go back down to 2.7G. I waited 10 minutes without sending any traffic and the RAM only came down to 3.9G. I ran another test with 200 requests instead of 100 and this didn't seem to affect the RAM usage, it went up to 4.8G again.

Second test: I changed my Python script to send 100 requests to 5 different URLs (1 URL that reaches the Python server and 4 URLs that are filtered out by the Coraza middleware) one after the other, which will make a total of 500 requests, with a 100ms sleep time between requests. After running the script three times this is what I got:

First time ------------ RAM: 6.0G Second time ------- RAM: 7.3G Third time ----------- RAM: 8.3G

After waiting 10 minutes without sending traffic the RAM came down to 6.0G.

I ran the same tests without the Coraza middleware and the RAM didn't even budged, it stayed at 2.4G before starting Traefik, after starting Traefik, and during the traffic tests.

Here is my config:

waf:
  plugin:
    coraza:
      directives:
        - SecRuleEngine On
        - SecDebugLog /dev/stdout
        - SecDebugLogLevel 9
        - Include @crs-setup.conf.example
        - Include @owasp_crs/**.conf
jcchavezs commented 5 months ago

Hi everyone, thanks for coming by this repository.

The problem seems to be very similar to what we experienced in https://github.com/corazawaf/coraza-proxy-wasm/issues/249. Although they are different code, what they have in common is the GC and that could be the issue.

One way to slice and dice this issue is to discriminate requests with/without payload and second, in directives set SecRequestBodyAccess Off before Include @owasp_crs/**.conf because the main hunch here is that the space we allocate for request bodies are the source of problem.

jcchavezs commented 5 months ago

In the mean time I released https://github.com/jcchavezs/coraza-http-wasm-traefik/releases/tag/v0.2.2 which attempts to introduce minor improvements in performance. Would be amazing if any of you could test it.

markuskirch commented 5 months ago

We're currently facing the same issue while testing the coraza Traefik plugin.

Problem Description

The Traefik Coraza Plugin leads to very high memory usage on our servers.

The memory used by the Traefik container grows with the container lifetime until the server is out of memory (16GB), and docker restarts the container. This currently happens roughly every hour. image

Configuration

Traefik v3.0 coraza-http-wasm-traefik v0.2.2

We run the following directives:

- SecRuleEngine On
- SecDebugLog /dev/stdout
- SecDebugLogLevel 3

- SecRequestBodyAccess On
- SecResponseBodyAccess Off

# set default error handling
- SecDefaultAction "phase:1,log,auditlog,deny,status:403"
- SecDefaultAction "phase:2,log,auditlog,deny,status:403"

# whitelist a trusted server used for end-to-end testing
- SecRule REMOTE_ADDR "@ipMatch 100.100.100.100" "id:1237,phase:1,allow"

# block access to specific paths
- SecRule REQUEST_URI "@rx \/web\/database\/.*" "id:1239,phase:1,log,deny,status:403,msg:'Access Denied'"

# Limit the size of the request body
- SecRequestBodyLimit 5242880 #5M
- SecRequestBodyNoFilesLimit 1048576 #1M
- SecRequestBodyInMemoryLimit 1048576 #1M

# Block SQL injection and XSS attacks
- SecRule ARGS "@detectSQLi" "id:1234,phase:2,log,deny,status:403,msg:'SQL Injection Detected'"
- SecRule ARGS "@detectXSS" "id:1235,phase:2,log,deny,status:403,msg:'XSS Attack Detected'"

# Block upload of files with dangerous extensions
- SecRule FILES_TMPNAMES "@rx \.(exe|bat|cmd|sh|php|pl|py)$" "id:1236,phase:2,log,deny,status:403,msg:'File Type Denied'"

Troubleshooting

We tried the following measures:

None of the measures above lead to Traefik leveling off at below 16GB of memory usage, albeit disabling request body access and all phase 2 rules made the container gain memory less quickly (1 hour between container restarts in comparison to about 30mins with request body access)

/proc/meminfo indicates that lots of memory is reserved but inactive. We're wondering if there's a connection between the max body size and the reserved memory.

Any thoughts on the issue?

Thanks for your dedication to the project!

david-garcia-garcia commented 4 months ago

Same issue here, memory usage grows even with minimal usage until pod is killed.

v0.2.2 has the same issue, tested.

Mike-the-one commented 3 months ago

Plan to use this plugin... is the memory leak still an issue?

markuskirch commented 3 months ago

Yes, unfortunately the problem hasn't been found or fixed.

There is an interesting idea for a workaround here: https://github.com/traefik/yaegi/issues/1590#issuecomment-2270703913

Mike-the-one commented 3 months ago

Thanks this works:

https://github.com/madebymode/traefik-modsecurity-plugin?tab=readme-ov-file

jcchavezs commented 1 month ago

This PR is up https://github.com/http-wasm/http-wasm-host-go/pull/86 and hopefully it will help in here.

sourabh-agrawal commented 1 month ago

This PR is up http-wasm/http-wasm-host-go#86 and hopefully it will help in here.

Hey @jcchavezs how can I get the fix running? Do you have a timeline for releasing the new version (0.2.3) of the coraza waf plugin with this fix?

ravenolf commented 2 weeks ago

We tested the newest version v0.3.0 on a small infrastructure. Typically, Traefik uses less than 100 MB of RAM in this setup. Once we enabled the plugin and configured the CRS and OWASP rules, it seemed to exhibit the same behaviour with significantly higher memory usage, going up to 2.5 GB with only a few HTTP requests reaching the reverse proxy. I assume this means the memory issue still persists?

Adding some details for reference:

lva-itscope commented 2 weeks ago

I am not quite sure if it is related, but when I tested v3.0.0 with traefik v3.2.0 the coraza plugin significantly increased response time and CPU usage. When a normal request comes in without the plugin, it takes about 10ms (95percentile). When the plugin is involved requests take about 1000ms (95percentile). For the CPU usage without the plugin we observe around 10% , however with the plugin involved during the processing of requests ist spikes to aroun 70%. (I tested on a virtual machine with 12 Cores.)

Memory is also increasing which is why I think it might all be related in some way. Interestingly enough, when requesting the same url several times in a row, response time decreases (is there some sort of caching) and after trying again a few minutes later it is back to 1000ms.

Details:

      plugin:
        coraza-waf:
          directives:
          # - SecDebugLog /dev/stdout
          # - SecDebugLogLevel 9
          - SecRule REQUEST_URI "@streq /admin" "id:101,phase:1,log,deny,status:403" 
          #  Allow some additional HTTP methods:
          # - SecAction "id:900200,phase:1,pass,t:none,nolog,setvar:'tx.allowed_methods=GET HEAD POST OPTIONS PUT PATCH DELETE CHECKOUT COPY LOCK MERGE MKACTIVITY MKCOL MOVE PROPFIND PROPPATCH UNLOCK REPORT'"
          # Allow some additional request content-types:
          - SecAction "id:900220,phase:1,pass,t:none,nolog,setvar:'tx.allowed_request_content_type=|application/x-www-form-urlencoded| |multipart/form-data| |multipart/related| |text/xml| |application/xml| |application/soap+xml| |application/json| |application/cloudevents+json| |application/cloudevents-batch+json| |text/plain| |application/proto|'"
          - SecRequestBodyAccess Off #Fix according to https://github.com/jcchavezs/coraza-http-wasm-traefik/issues/9#issuecomment-2146919384
          - Include @coraza.conf-recommended
          - Include @crs-setup.conf.example
          - Include @owasp_crs/**.conf
          - SecRuleEngine On