Closed flavioleggio closed 1 year ago
Related to #1574 Could be examined in conjunction with #2513
To give some more context on this, i ANALYZE
d hydra_oauth2_authentication_request
table in two consecutive days (same hour in the day, both before janitor
job) and i found out that we get 1.4 millions of new records every day and we just delete 240k records with actual cleanup procedure. This results in a net increase of 1.2 millions of new records every day, just for that table.
Only keeping the most recent state is probably an option we could enable in the janitor. It would need to be set explicitly but it's definitely an option and I think it would make tons of sense to have it as part of the system!
Hello. In our case those tables are growing fast: "public.hydra_oauth2_refresh" "public.hydra_oauth2_oidc" "public.hydra_oauth2_code" "public.hydra_oauth2_authentication_request" "public.hydra_oauth2_consent_request"
And it would be fine, if janitor
wil be able to cleanup them.
@flavioleggio just checking if this issue has been resolved by your recent works on the janitor?
@flavioleggio just checking if this issue has been resolved by your recent works on the janitor?
Actually no, this would be step two of my original idea. In the actual version janitor just executes the same queries faster, but deleted records did not change.
I just need to find time to work on this :)
I see, thank you for the quick response :)
Hello @flavioleggio @aeneasr. I was trying to work in this issue, because our prod database is growing really fast and the upgrade to next version is really slow and it take more than an hour to complete, so we are trying to clean a little bit all the tables. I was reading the queries and i see that we are going to delete the challenge in hydra_oauth2_consent_request
or hydra_oauth2_authentication_request
if it have a null or error, and a requested_at < 'X', but this isn't going to delete anything if the challenge was correct, therefore as we do not need the old challenge i think that we also need something for clean the old one without errors or null, that is
SELECT %[1]s.challenge
FROM %[1]s
LEFT JOIN %[2]s ON %[1]s.challenge = %[2]s.challenge
WHERE (
(%[2]s.challenge IS NULL)
OR (%[2]s.error IS NOT NULL AND %[2]s.error <> '{}' AND %[2]s.error <> '')
)
AND %[1]s.requested_at < ?
ORDER BY %[1]s.challenge
LIMIT %[3]d
use a OR instead of an AND in the AND %[1]s.requested_at < ?
condition. This will clean also the hydra_oauth2_oidc
and hydra_oauth2_code
who are also growing a lot in our prod environ. What do you think about this? I could prepare a PR for this if you don't mind.
The old challenge is needed as it might still be linked to an oauth2 token chain, so unfortunately we can't delete those :/
We have a very similar problem with our system. Data size is up to >1TB. Access & refresh tokens are now under control with the latest janitor but the requests and related tables are growing as described above. I think having some options in Janitor to be a little more aggressive in selecting request records may be needed.
1TB? Damn! What system are you running? :D @grantzvolsky is refactoring the internals over at https://github.com/ory/hydra/pull/2796 and a big part of that is changing the data structure of consent/login/session. So we are doing an architectual fix rather than trying to come up with the smartest clean up routines. It will take some time though for this to get released.
Ah that is good to know. I guess we will need to run some manual queries to clean up stuff we don't want in the meantime.
Reading through multiple issues on topic 'cleanup-obsolete-entities-and/or-improve-data-model' seems this #2514 is the closest to the general question and is also included in the v2.0.x milestone.
So, while highly anticipating the data model modifications in v2.0.x which will address the more easy cleanups - the question is - are there any 'more-or-less proven' direct sql statements which may be used to delete unwanted entities in latest v1.x? Proven more in a sense of confirmation by original contributors - to guarantee that hydra will not break in case some data are deleted.
Referring i.e. previous comment by @pharman mentioning
I guess we will need to run some manual queries to clean up stuff we don't want in the meantime
Also, my case is similar to some other - ergo - accumulation of high amount of tokens with definitely expired ttl.
Maybe we should put those statements, some more context and possible instructions (caveats) as part of documentation.
I can provide with my company solution, which does not break any Hydra's functionality in our production environment, while keeping db size small enough and guarantee no performance loss. I have to point out the fact that we don't need auditing on Hydra's data.
We select delete candidates searching for inactive tokens.
SELECT signature FROM hydra_oauth2_refresh WHERE active=0 limit 1000000;
We group signatures for delete candidates in small enough blocks (100 records in our case) and we delete them in a loop.
DELETE FROM hydra_oauth2_refresh WHERE signature IN ( $signature_block );
We select delete candidates searching for old requests (we keep 1 day in our case).
TIME=$(date -u -d '-1 day' +'%Y-%m-%d %H:%M:%S')
SELECT login_session_id FROM HYDRA.hydra_oauth2_authentication_request WHERE requested_at < '${TIME}' limit 2000000;
We delete using session IDs in a loop.
DELETE FROM hydra_oauth2_authentication_session WHERE id IN ( $sessions_block );
We select delete candidates searching for requests associated with old active refresh tokens or with no associated active refresh tokens. We keep refresh tokens for 15 days, thus user will have no access if the RT is not renewed in 15 days: this is equal to 15 days of user inactivity, which is enough for our use case. If a consent request has no active RT associated, it is considered old after 1 hour.
SELECT
cr.challenge
FROM
hydra_oauth2_consent_request cr
WHERE
(
SELECT r.requested_at FROM hydra_oauth2_refresh r
WHERE cr.challenge = r.challenge_id
AND r.active = 1
LIMIT 1
) < UTC_TIMESTAMP() - INTERVAL 15 DAY
OR
(
NOT EXISTS
(
SELECT NULL FROM hydra_oauth2_refresh r
WHERE cr.challenge = r.challenge_id
AND r.active = 1
LIMIT 1
)
AND cr.requested_at < UTC_TIMESTAMP() - INTERVAL 1 HOUR
)
LIMIT 1000000;
We delete using challenges in a loop.
DELETE FROM hydra_oauth2_consent_request WHERE challenge IN ( $challenge_block );
We select delete candidates searching for old tokens (we keep 1 hour in our case).
TIME=$(date -u -d '-1 hour' +'%Y-%m-%d %H:%M:%S')
SELECT signature FROM hydra_oauth2_access WHERE requested_at < '${TIME}';
We delete using signatures in a loop.
DELETE FROM hydra_oauth2_access WHERE signature IN ( $signature_block );
We select delete candidates searching for used requests (in this case it doesn't matter for us if they are recent).
SELECT challenge FROM HYDRA.hydra_oauth2_logout_request where was_used=1 limit 1000000;
We delete using challenges in a loop.
DELETE FROM HYDRA.hydra_oauth2_logout_request where challenge IN ( $logout_requests_block );
We select delete candidates searching for inactive codes (also here we don't mind if they are recent).
SELECT signature FROM hydra_oauth2_code WHERE active=0 limit 1000000;
We delete using signatures in a loop.
DELETE FROM hydra_oauth2_code WHERE signature IN ( $signature_block );
I can't express my gratitude @flavioleggio towards the extensive post above. This will certainly come handy in storage size reduction. Thank you.
Hello contributors!
I am marking this issue as stale as it has not received any engagement from the community or maintainers for a year. That does not imply that the issue has no merit! If you feel strongly about this issue
Throughout its lifetime, Ory has received over 10.000 issues and PRs. To sustain that growth, we need to prioritize and focus on issues that are important to the community. A good indication of importance, and thus priority, is activity on a topic.
Unfortunately, burnout has become a topic of concern amongst open-source projects.
It can lead to severe personal and health issues as well as opening catastrophic attack vectors.
The motivation for this automation is to help prioritize issues in the backlog and not ignore, reject, or belittle anyone.
If this issue was marked as stale erroneously you can exempt it by adding the backlog
label, assigning someone, or setting a milestone for it.
Thank you for your understanding and to anyone who participated in the conversation! And as written above, please do participate in the conversation if this topic is important to you!
Thank you 🙏✌️
@flavioleggio Hey we are currently running an older Version of hydra and facing problems of migrating because the 'hydra_oauth2_refresh' is growing quite fast in size. I tested your solution to remove entries with active=false
for the active=false
and it does not seem to invalidate the newer refresh tokens. Did you encounter any problems with the deletion procedure you explained above?
Hi @PeteMac88,
As you can read in this comment we did not experience any issue and I can tell you that we still use this procedure in production. Removing entries with active=false
guarantees you to keep currently used refresh tokens. From my experience with Hydra v1.x, I understand that at most one refresh token with active=true
can exist for a specific consent session and you don't need the inactive ones; as long as you don't need to keep them for auditing purpose you can safely delete them :wink:
If you look up to the consent requests clenup it is clear that we look for active refresh tokens to exclude currently used consent sessions that we want to keep.
Problem description
The new
hydra janitor
command only takes into account tokens and requests tables. For the formers cleaning up is as simple as considering that old tokens (i.e. whose TTL is expired) are of no use. For the latters cleaning up is far more complicated, as it involves considering that expired records could also be of some use (for example in remembering subject's last scope grants to a certain client and do not ask them again to that user). This scenario results in requests tables growing up indefinitely, as requests records are not purged unless those requests are unhandled or have been rejected (which is the minority of records). Moreover, related tables likehydra_oauth2_code
andhydra_oauth2_oidc
grow with their own rates, as they are not explicitly purged from janitor. However, those tables would also benefit from cleaning up requests tables more effectively - still taking care of not to break any flows - because of theON DELETE CASCADE
reference which holds in their foreign keys. For example, deleting a row inhydra_oauth2_consent_request
would delete the related row inhydra_oauth2_consent_request_handled
, which in turn would delete related rows inhydra_oauth2_{access|code|oidc|pkce|refresh}
.Proposed solution
As far as i understand, the need for keeping expired login and consent requests is constrained to flows that ask for the last state of a particular subject (i.e. records from its last login and consent flows , whether their TTLs are expired or not), then if we could query all the records related to the previous states, we would purge them since they are of no use.
Alternatives
Another way to deal with the need to keep subject's state is to store it in a dedicated table and just purge transient data (requests, tokens and everything that has a predefined lifetime).
Additional context
This table is a snapshot of my production database after two weeks of running
hydra janitor
on both tokens and requests every night, with nokeep-if-younger
option. Notice that database was completely purged two weeks ago, as it was too large to guarantee a small enough mainteinance window for the database migrations needed to update to hydra v1.10.1.This image summarizes growing execution time from empty database up to the situation today. Take into account that janitor job starts every night /morning at 3:10 AM (UTC+02:00). Here we go from a bunch of seconds up to almost an hour... and rising!