[FEATURE] Allow users to specify plugin behavior on cache refreshing failure

darkweaver87 commented 7 months ago

Is your feature request related to a problem? Please describe. 🐛 We implemented a PoC with your wonderful plugin and we would like to put it in production but we still have one remaining issue using stream mode (but using other mode don't change anything).

Crowdsec free deployment relies on some agents sending their decisions to a local API. This LAPI can't be scaled by design as this will mean agents will potentially try to send their data to an LAPI they are not registered on.

Consequently, this means that's technically speaking we can "lose" the LAPI for a given amount of time and it can be unavailable during the cache refresh. If it's the case then Traefik returns a 403.

Even if I tend to agree that it's a good security practice to block when their is a doubt on some services that's not really ideal. In my case, I need to allow users to access the service on such a failure.

Describe the solution you'd like ✨

Thus, I was thinking about either:

allow users to specify the behavior they want when the refresh fails
if the refresh fails, keep the last known state for a given grace period

I will be happy to contribute, just let me know your thoughts on this :-)

Additional context

crowdsec version: v1.6.0
crowdsec plugin version: v1.2.1
traefik version: v2.11.2

mathieuHa commented 7 months ago

Hi,

Thanks for the interest in the plugin, we're discussing the issue you encountered with @maxlerebourg.

l280 bouncer.go

        // Right here if we cannot join the stream we forbid the request to go on.
    if bouncer.crowdsecMode == configuration.StreamMode || bouncer.crowdsecMode == configuration.AloneMode {
        if isCrowdsecStreamHealthy {
            handleNextServeHTTP(bouncer, remoteIP, rw, req)
        } else {
            bouncer.log.Debug(fmt.Sprintf("ServeHTTP isCrowdsecStreamHealthy:false ip:%s", remoteIP))
            handleBanServeHTTP(bouncer, rw)
        }
    }

I'm thinking about an internal counter, that allows X number of time the stream to be unhealthy before going to 403 requests.
So the updateInterval multiplied by the counter, would allow that grace period.

With some default variable exemple:
streamUnhealthyMaxTime=3
UpdateIntervalSeconds=60

So instead of blocking at 1 min if the LAPI is unreacheable, it would be blocked after 3 min.
A successfull sync with the LAPI would reset that counter

darkweaver87 commented 6 months ago

Hello,

Thank you for your feedback :-) Looks good to me :-)

Thanks :+1:

Rémi

mathieuHa commented 6 months ago

Hi,

We're almost done implementing it, I have tested basic behavior yesterday:

never block if UpdateMaxFailure=-1
block after first fail if UpdateMaxFailure=0 (default)
block after 10 failed attempt if UpdateMaxFailure=10
unblock when successful attempt and reset counter

We should merge and release a beta version very soon.

maxlerebourg / crowdsec-bouncer-traefik-plugin

[FEATURE] Allow users to specify plugin behavior on cache refreshing failure #152