indigo-iam / iam

INDIGO Identity and Access Management Service
https://indigo-iam.github.io/
Other
101 stars 43 forks source link

HR DB check needs a configurable grace period #805

Open maarten-litmaath opened 2 months ago

maarten-litmaath commented 2 months ago

It has been observed that even when a user's new contract starts the next day after the previous contract ended, and the LHC experiment affiliation has remained the same, the HR DB check can decide to put the user account into the PENDING_REMOVAL state, from which it currently cannot recover automatically (see #768). To avoid glitches due to the exact meaning and implementation of the relevant HR DB API time stamps, which might even change, possibly accidentally, the HR DB check needs to be made more robust: a configurable grace period of at least 1 day should be applied before concluding a user is no longer affiliated with a given experiment.

rmiccoli commented 2 months ago

I think we should consider a few things, first of all the end time. If the user's new affiliation starts the day after the previous one expires and the end time expires with the old affiliation, then the IAM internal lifecycle job adds the label PENDING_SUSPENSION to the account, but the user remains active for a certain grace period. If we extend the CERN HR DB job, this doesn't change.

I'd like to understand if the end time always coincides with the expiry of the CERN affiliation. Is it always set? and how is it set?

In this specific case, my guess is that the CERN job runs when the new affiliation has already started, otherwise the user would be disabled and then restored at the next run.

Also, does the membership expire at 23:59 or 00:00 and does the new one, the day after, start at exactly what time?

deesto commented 2 months ago

Someone with access to the CERN HR DB should confirm, but from what we have been shown without direct access, records seem to have a start and end date, with no time specified, e.g.:

"experiment": "ATLAS",
"startDate": "2024-07-01",
"endDate": "2026-06-30"`

If that is indeed the case, the time is likely the same for start and end.

maarten-litmaath commented 2 months ago

Hi all, we may well need to have a chat with an HR DB expert about these matters. That may take some time due to the summer holiday season...

However, my point is that we should ensure we have enough flexibility on the side of IAM to counteract any unwanted behavior prompted by results from HR DB queries. That implies applying grace periods where needed and being able to correct a person's status from the GUI. I suspect we are close already, but with v1.9.0 we observed at least the case of an ATLAS user that had to be manually restored, while the HR DB suggested his next contract started on the next day after his previous contract finished --> we need to avoid that an admin intervention is needed for such cases...

Even if we get an exact specification of the current behavior of the API, there may well come a code change (on purpose or accidental) that adjusts the exact meaning of a timestamp: normally the users of such info do not care much, while we do!

For example, we just add 1 day to the contract end date and only flag the account if on the second day the user still does not have a new contract. We probably would like to have that grace period configurable.

maarten-litmaath commented 1 month ago

Hi again, we ran into yet another example yesterday: delayed updates of the HR DB caused an LHCb account to be blocked, while everything looked OK on the LHCb side...

vokac commented 1 month ago

We discussed in today WLCG AuthZ meeting that while CERN HR API returns just

"startDate": "2024-07-01"
"endDate": "2026-06-30"

we should threat it as

"startDate": "2024-07-01 00:00:00"
"endDate": "2026-06-30 23:59:59"
deesto commented 1 month ago

we should threat it as

"startDate": "2024-07-01 00:00:00"
"endDate": "2026-06-30 23:59:59"

Is the 1s gap below threshold or enough to continue to trigger suspension?

maarten-litmaath commented 1 month ago

Good point. If we still see unwarranted suspensions, we could add that last second. Adjusted example:

[...]
# previous contract
"endDate": "2024-07-01 00:00:00",
[...]
# current contract
"startDate": "2024-07-01 00:00:00",
[...]

As the HR DB is sometimes seen to get updated belatedly, we should try to avoid immediately suffering from that; hopefully we already have enough grace period flexibility...

belforte commented 5 days ago

why do you guys discuss "hours" here ?? Let's have a 7-day grace period. End of story. I AM REALLY REALLY REALLY SICK AND TIRED OF RESTORING USERS MANUALLY

I have seen cases where it took days for User's Office to get records straight when people simply change institution, there is no reason they should be affected by that.

Unless user association was terminated because of evil acts, there is no possible harm by having a decent grace period. BUT YOU MUST WARN THE USER, so that they can talk to User Office and get things corrected before voms-proxy-init stops working.

belforte commented 5 days ago

maybe you guys should do the work to deal with users, it will make it easier for you to get it clear where priority lies. I do not give a damn for a configurable grace period nest year, I want a sensible grace period yesterday !! How could this tool ever got to be deployed without it ? And, by the way, what do all those fields in user status page EXACTLY mean ? Where is this documented ? image

maarten-litmaath commented 5 days ago

Ciao Stefano, we will look into having this issue worked on with a higher priority. More news hopefully soon...

rmiccoli commented 5 days ago

maybe you guys should do the work to deal with users, it will make it easier for you to get it clear where priority lies. I do not give a damn for a configurable grace period nest year, I want a sensible grace period yesterday !! How could this tool ever got to be deployed without it ? And, by the way, what do all those fields in user status page EXACTLY mean ? Where is this documented ? image

Hi, unfortunately the meaning of all those labels (referring to the account lifecycle) were not documented in Indigo IAM website, but there is already a PR in review almost ready to be merged https://github.com/indigo-iam/iam-website/pull/108.

vokac commented 5 days ago

Is not a bit weird to require grace period for a service that should enforce security policies? That's why I think it make sense to fix how we use CERN HR infrastructure data instead of applying grace period workarounds in IAM...

belforte commented 5 days ago

thank you all for listening. @rmiccoli nice diagram !

@vokac, I appreciate the philosophical point, but we need to have something that can work. One thing is that we must be able to stop evil-doers asap (and being able to search users by DN is important there! ). A different thing is how to deal gracefully with the fact that humans at CERN may not always do every thing in a matter of hours. When a student moves to another institution we do not kick them out of the experiment, nor does CERN disable their computer account. So we do not take their proxy or token permissions out either. If that can be achieved by smart interpretation of HR records, fine. But the end result must be that I do not get bugged because a User Office employee took a day off !

belforte commented 5 days ago

nor because they delete old institution on Friday and add new one on Monday !