ory / hydra

The most scalable and customizable OpenID Certified™ OpenID Connect and OAuth Provider on the market. Become an OpenID Connect and OAuth2 Provider over night. Broad support for related RFCs. Written in Go, cloud native, headless, API-first. Available as a service on Ory Network and for self-hosters.
https://www.ory.sh/?utm_source=github&utm_medium=banner&utm_campaign=hydra
Apache License 2.0
15.63k stars 1.5k forks source link

Extend janitor's cleanup procedures #2514

Closed flavioleggio closed 1 year ago

flavioleggio commented 3 years ago

Problem description

The new hydra janitor command only takes into account tokens and requests tables. For the formers cleaning up is as simple as considering that old tokens (i.e. whose TTL is expired) are of no use. For the latters cleaning up is far more complicated, as it involves considering that expired records could also be of some use (for example in remembering subject's last scope grants to a certain client and do not ask them again to that user). This scenario results in requests tables growing up indefinitely, as requests records are not purged unless those requests are unhandled or have been rejected (which is the minority of records). Moreover, related tables like hydra_oauth2_code and hydra_oauth2_oidc grow with their own rates, as they are not explicitly purged from janitor. However, those tables would also benefit from cleaning up requests tables more effectively - still taking care of not to break any flows - because of the ON DELETE CASCADE reference which holds in their foreign keys. For example, deleting a row in hydra_oauth2_consent_request would delete the related row in hydra_oauth2_consent_request_handled, which in turn would delete related rows in hydra_oauth2_{access|code|oidc|pkce|refresh}.

Proposed solution

As far as i understand, the need for keeping expired login and consent requests is constrained to flows that ask for the last state of a particular subject (i.e. records from its last login and consent flows , whether their TTLs are expired or not), then if we could query all the records related to the previous states, we would purge them since they are of no use.

Alternatives

Another way to deal with the need to keep subject's state is to store it in a dedicated table and just purge transient data (requests, tokens and everything that has a predefined lifetime).

Additional context

This table is a snapshot of my production database after two weeks of running hydra janitor on both tokens and requests every night, with no keep-if-younger option. Notice that database was completely purged two weeks ago, as it was too large to guarantee a small enough mainteinance window for the database migrations needed to update to hydra v1.10.1.

Table Name Data Length Row Count
hydra_oauth2_access 5,6G 686.896
hydra_oauth2_authentication_request 5,4G 4.496.378
hydra_oauth2_authentication_request_handled 2,2G 3.373.407
hydra_oauth2_code 15G 1.966.991
hydra_oauth2_consent_request 2,3G 1.223.914
hydra_oauth2_consent_request_handled 4,4G 1.756.063
hydra_oauth2_oidc 13G 1.525.255
hydra_oauth2_pkce 1,3G 2.311
hydra_oauth2_refresh 4,3G 516.967

This image summarizes growing execution time from empty database up to the situation today. Take into account that janitor job starts every night /morning at 3:10 AM (UTC+02:00). image Here we go from a bunch of seconds up to almost an hour... and rising!

flavioleggio commented 3 years ago

Related to #1574 Could be examined in conjunction with #2513

flavioleggio commented 3 years ago

To give some more context on this, i ANALYZEd hydra_oauth2_authentication_request table in two consecutive days (same hour in the day, both before janitor job) and i found out that we get 1.4 millions of new records every day and we just delete 240k records with actual cleanup procedure. This results in a net increase of 1.2 millions of new records every day, just for that table.

aeneasr commented 3 years ago

Only keeping the most recent state is probably an option we could enable in the janitor. It would need to be set explicitly but it's definitely an option and I think it would make tons of sense to have it as part of the system!

roko99 commented 3 years ago

Hello. In our case those tables are growing fast: "public.hydra_oauth2_refresh" "public.hydra_oauth2_oidc" "public.hydra_oauth2_code" "public.hydra_oauth2_authentication_request" "public.hydra_oauth2_consent_request"

And it would be fine, if janitor wil be able to cleanup them.

aeneasr commented 3 years ago

@flavioleggio just checking if this issue has been resolved by your recent works on the janitor?

flavioleggio commented 3 years ago

@flavioleggio just checking if this issue has been resolved by your recent works on the janitor?

Actually no, this would be step two of my original idea. In the actual version janitor just executes the same queries faster, but deleted records did not change.

I just need to find time to work on this :)

aeneasr commented 3 years ago

I see, thank you for the quick response :)

javiercri commented 3 years ago

Hello @flavioleggio @aeneasr. I was trying to work in this issue, because our prod database is growing really fast and the upgrade to next version is really slow and it take more than an hour to complete, so we are trying to clean a little bit all the tables. I was reading the queries and i see that we are going to delete the challenge in hydra_oauth2_consent_request or hydra_oauth2_authentication_request if it have a null or error, and a requested_at < 'X', but this isn't going to delete anything if the challenge was correct, therefore as we do not need the old challenge i think that we also need something for clean the old one without errors or null, that is

    SELECT %[1]s.challenge
    FROM %[1]s
    LEFT JOIN %[2]s ON %[1]s.challenge = %[2]s.challenge
    WHERE (
        (%[2]s.challenge IS NULL)
        OR (%[2]s.error IS NOT NULL AND %[2]s.error <> '{}' AND %[2]s.error <> '')
    )
    AND %[1]s.requested_at < ?
    ORDER BY %[1]s.challenge
    LIMIT %[3]d

use a OR instead of an AND in the AND %[1]s.requested_at < ? condition. This will clean also the hydra_oauth2_oidc and hydra_oauth2_code who are also growing a lot in our prod environ. What do you think about this? I could prepare a PR for this if you don't mind.

aeneasr commented 3 years ago

The old challenge is needed as it might still be linked to an oauth2 token chain, so unfortunately we can't delete those :/

pharman commented 3 years ago

We have a very similar problem with our system. Data size is up to >1TB. Access & refresh tokens are now under control with the latest janitor but the requests and related tables are growing as described above. I think having some options in Janitor to be a little more aggressive in selecting request records may be needed.

aeneasr commented 3 years ago

1TB? Damn! What system are you running? :D @grantzvolsky is refactoring the internals over at https://github.com/ory/hydra/pull/2796 and a big part of that is changing the data structure of consent/login/session. So we are doing an architectual fix rather than trying to come up with the smartest clean up routines. It will take some time though for this to get released.

pharman commented 3 years ago

Ah that is good to know. I guess we will need to run some manual queries to clean up stuff we don't want in the meantime.

abregar commented 2 years ago

Reading through multiple issues on topic 'cleanup-obsolete-entities-and/or-improve-data-model' seems this #2514 is the closest to the general question and is also included in the v2.0.x milestone.

So, while highly anticipating the data model modifications in v2.0.x which will address the more easy cleanups - the question is - are there any 'more-or-less proven' direct sql statements which may be used to delete unwanted entities in latest v1.x? Proven more in a sense of confirmation by original contributors - to guarantee that hydra will not break in case some data are deleted.

Referring i.e. previous comment by @pharman mentioning

I guess we will need to run some manual queries to clean up stuff we don't want in the meantime

Also, my case is similar to some other - ergo - accumulation of high amount of tokens with definitely expired ttl.

Maybe we should put those statements, some more context and possible instructions (caveats) as part of documentation.

flavioleggio commented 2 years ago

I can provide with my company solution, which does not break any Hydra's functionality in our production environment, while keeping db size small enough and guarantee no performance loss. I have to point out the fact that we don't need auditing on Hydra's data.

1. We run a cleanup on refresh tokens.

We select delete candidates searching for inactive tokens.

SELECT signature FROM hydra_oauth2_refresh WHERE active=0 limit 1000000;

We group signatures for delete candidates in small enough blocks (100 records in our case) and we delete them in a loop.

DELETE FROM hydra_oauth2_refresh WHERE signature IN ( $signature_block );

2. We run a cleanup on authentication requests.

We select delete candidates searching for old requests (we keep 1 day in our case).

TIME=$(date -u -d '-1 day' +'%Y-%m-%d %H:%M:%S')

SELECT login_session_id FROM HYDRA.hydra_oauth2_authentication_request WHERE requested_at < '${TIME}' limit 2000000;

We delete using session IDs in a loop.

DELETE FROM hydra_oauth2_authentication_session WHERE id IN ( $sessions_block );

3. We run a cleanup on consent requests.

We select delete candidates searching for requests associated with old active refresh tokens or with no associated active refresh tokens. We keep refresh tokens for 15 days, thus user will have no access if the RT is not renewed in 15 days: this is equal to 15 days of user inactivity, which is enough for our use case. If a consent request has no active RT associated, it is considered old after 1 hour.

SELECT
    cr.challenge
FROM
    hydra_oauth2_consent_request cr
WHERE
    (
        SELECT r.requested_at FROM hydra_oauth2_refresh r
        WHERE cr.challenge = r.challenge_id
        AND r.active = 1
        LIMIT 1
    ) < UTC_TIMESTAMP() - INTERVAL 15 DAY
    OR
    (
        NOT EXISTS
        (
            SELECT NULL FROM hydra_oauth2_refresh r
            WHERE cr.challenge = r.challenge_id
            AND r.active = 1
            LIMIT 1
        )
        AND cr.requested_at < UTC_TIMESTAMP() - INTERVAL 1 HOUR
    )
LIMIT 1000000;

We delete using challenges in a loop.

DELETE FROM hydra_oauth2_consent_request WHERE challenge IN ( $challenge_block );

4. We run a cleanup on access tokens.

We select delete candidates searching for old tokens (we keep 1 hour in our case).

TIME=$(date -u -d '-1 hour' +'%Y-%m-%d %H:%M:%S')

SELECT signature FROM hydra_oauth2_access WHERE requested_at < '${TIME}';

We delete using signatures in a loop.

DELETE FROM hydra_oauth2_access WHERE signature IN ( $signature_block );

5. We run a cleanup on logout requests.

We select delete candidates searching for used requests (in this case it doesn't matter for us if they are recent).

SELECT challenge FROM HYDRA.hydra_oauth2_logout_request where was_used=1 limit 1000000;

We delete using challenges in a loop.

DELETE FROM HYDRA.hydra_oauth2_logout_request where challenge IN ( $logout_requests_block );

6. We run a cleanup on auth codes.

We select delete candidates searching for inactive codes (also here we don't mind if they are recent).

SELECT signature FROM hydra_oauth2_code WHERE active=0 limit 1000000;

We delete using signatures in a loop.

DELETE FROM hydra_oauth2_code WHERE signature IN ( $signature_block );
abregar commented 2 years ago

I can't express my gratitude @flavioleggio towards the extensive post above. This will certainly come handy in storage size reduction. Thank you.

github-actions[bot] commented 1 year ago

Hello contributors!

I am marking this issue as stale as it has not received any engagement from the community or maintainers for a year. That does not imply that the issue has no merit! If you feel strongly about this issue

Throughout its lifetime, Ory has received over 10.000 issues and PRs. To sustain that growth, we need to prioritize and focus on issues that are important to the community. A good indication of importance, and thus priority, is activity on a topic.

Unfortunately, burnout has become a topic of concern amongst open-source projects.

It can lead to severe personal and health issues as well as opening catastrophic attack vectors.

The motivation for this automation is to help prioritize issues in the backlog and not ignore, reject, or belittle anyone.

If this issue was marked as stale erroneously you can exempt it by adding the backlog label, assigning someone, or setting a milestone for it.

Thank you for your understanding and to anyone who participated in the conversation! And as written above, please do participate in the conversation if this topic is important to you!

Thank you 🙏✌️

PeteMac88 commented 8 months ago

@flavioleggio Hey we are currently running an older Version of hydra and facing problems of migrating because the 'hydra_oauth2_refresh' is growing quite fast in size. I tested your solution to remove entries with active=false for the active=false and it does not seem to invalidate the newer refresh tokens. Did you encounter any problems with the deletion procedure you explained above?

flavioleggio commented 8 months ago

Hi @PeteMac88,

As you can read in this comment we did not experience any issue and I can tell you that we still use this procedure in production. Removing entries with active=false guarantees you to keep currently used refresh tokens. From my experience with Hydra v1.x, I understand that at most one refresh token with active=true can exist for a specific consent session and you don't need the inactive ones; as long as you don't need to keep them for auditing purpose you can safely delete them :wink:

If you look up to the consent requests clenup it is clear that we look for active refresh tokens to exclude currently used consent sessions that we want to keep.