ministryofjustice / analytical-platform

Analytical Platform • This repository is defined and managed in Terraform
https://docs.analytical-platform.service.justice.gov.uk
MIT License
12 stars 4 forks source link

✨ Feature Request: Access to datasets can be granted for a defined period of time #2136

Open pbbgss opened 12 months ago

pbbgss commented 12 months ago

Which tool do you need help with?

Control Panel

What happened?

Not a bug but maybe a feature request - apologies if this is the wrong place. I would like to periodically review who has access to a particular s3 bucket, ideally some sort of opt-in thing where if a user does not they lose access. Is there a better way to do this getting their email addressed from the bucket access lists and emailing them? Had a look through the guidance etc and not been able to find anything - apologies if I missed it!

Relevant log output

No response

pbbgss commented 11 months ago

It would also be very useful to be able to programmatically set which paths within a bucket people can access if that was possible too? (Appreciate these may be fairly large asks - apologies!)

pbbgss commented 2 months ago

@darren1988 Thanks for looking into this - Is there any documentation relating to it?

simon-pope commented 1 month ago

@pbbgss this feature request was reviewed this morning and the team require some further information to proceed:

  1. Why is this information important to you? What does it allow you to achieve?
  2. Could you define "periodically"? Is this need ad-hoc or driven by a schedule/event?
  3. Is there any other information about the user, besides access to the S3 bucket and their email, that is required?

Could you also create a new feature request for the requirement "set which paths within a bucket people can access" so that it can be tracked separately. Thanks

pbbgss commented 1 month ago

@simon-pope Both feature requests relate to the same problem so I'll put the background here and then you can decide whether they should still be separate.

Background

We (prisons statistics) create several data series that are the basis of our Offender Management Statistics Quarterly (OMSQ) publication and some others that relate to internal management information etc. Currently all our data lives in sub-folders of a single "internal" (internal to prisons statistics) bucket.

Other users sometimes request access to our data for specific projects. Assuming there is legitimate need etc we will grant access to a "shared" bucket which contains the same folder structure as our "internal" one. We will usually grant access to all datasets in a series i.e. one subfolder in the bucket. Datasets in a series only get coppied over to the "shared" bucket once the edition of OMSQ that they were first used in has been published i.e. we don't give access to data we haven't yet used in a publication.

Granting and reviewing access

We don't want to give indefinite access but it is often hard to predict how long a user will need access (and this can vary greatly depending on the type of work). Currently we manually email everyone with access to the "shared" bucket every two months and if they do not respond we manually remove their access. Being able to grant temporary access that expires after a certain amount of time and ideally also being able to give them and opt-in to maintain access once the time is up would be useful. I can see this being useful in other circumstances, for example I am a coding mentor and occasionally I need to be given temporary access to another team's bucket if helping a mentee with a project etc.

Programmatic path access

We are currently able to give users access to specific paths in an s3 bucket. We make use of this by giving users access to specific folders in our shared bucket to give them access to only the data series they need. However, it would be too much work to manually grant access to just the "published" datasets in a series if we were giving access to a bucket that contains both published and unpublished data (there are tens to hundreds to files depending on the series). This results in data being duplicated between our "internal" and "shared" buckets. They are all Parquet files so efficiently stored but it is still potentially unnecessary duplication. If instead of typing in the individual paths on the AP control panel there was a config file that we could generate programmatically it would remove the need for a two bucket system.

Specific answers to your questions

  1. Why is this information important to you? What does it allow you to achieve?

Time saving and reduced risk of errors by automating a manual task. I imagine removing people who no longer have a legitimate reason to access some potentially sensitive data falls under "best practice" etc as well.

  1. Could you define "periodically"? Is this need ad-hoc or driven by a schedule/event?

Currently we email everyone at the same time every two months. There is a trade off between the admin of the manual process, removing people promptly, and not annoying everyone by sending loads of emails. Realistically how long people require access varies greatly so being able to set a custom periodicity based on when they first get access might be useful. Another potentially complicating factor is that some individuals may have access to multiple data series with different expected access duration requirements.

  1. Is there any other information about the user, besides access to the S3 bucket and their email, that is required?

Which paths in the bucket they can access.

If you have any more questions give me a shout!

simon-pope commented 1 month ago

@pbbgss Thank you for this information, it's very helpful. Based on the above, I've attempted to extract some requirements that I can take to the team for review and refinement. Please review and let me know if they meet your ask. Happy to have a call to refine these further.

  1. Ability to grant access to a subfolder/dataset for a defined period of time
  2. Access to a subfolder/dataset is automatically removed when the granted access period expires
  3. When granting access, granter can decide whether access can be extended by the grantee before expiry
  4. User is informed when they receive access, and when their access will expire
  5. User with time dependent access is warned before their access to a subfolder/dataset is automatically removed
  6. If given permission on initial access grant, grantee can extend their own access period beyond the current expiry date
  7. Within a dataset, tables can be set to a "shareable" state. When access is granted to a dataset, these tables are accessible by the grantee by default. All other tables are not shared.

Note: point 7 updated after clarification from requestor.

pbbgss commented 1 month ago

@simon-pope That all sounds great thank you!

pbbgss commented 1 month ago

@simon-pope I have been thinking about point 7 a bit more and I wonder if it would be better to have a list of "share" tables rather than those "not to share".

In the interval between publications several tables will be created at various points that we do not want to be sharable until after the next publication.

In a system with a "do not share" list I am concerned that there would be a lot of overhead and potential for mistakes updating the "do not share" list each time a new table is created (some are created weekly). In contrast a "share" list could be updated once per quarter after the publication has been released.

I think I am imagining some sort of config file that I can programmatically generate with each user and the specific paths they can access (if there are restrictions). A bit like the existing system for path access but easier to work with on a large scale. Does that sound reasonable?

simon-pope commented 1 month ago

@pbbgss thanks for the update. If I understand correctly, by using a "share" list for each user, any new tables created after they were granted access will not be available to them. Using a "don't share" list would mean having to update that list for every user when a new table is created or risk them being able to access unpublished data. Makes sense if so.

Just as an aside, I'm currently doing a review of the Feature Request process for AP and how things could be improved for both the team and those raising requests. If you have any feedback on this process, including any improvements, feel free to drop me a slack message.