kartoza / WRODataPlatform

WRC Water Research Observatory Data Platform
0 stars 3 forks source link

Roles and permissions #5

Open gubuntu opened 2 years ago

gubuntu commented 2 years ago

We need suitable roles and permissions so that, for example:

Ideally this should be SSO between CKAN and GCP IAM.

gubuntu commented 2 years ago

@mikev3003 as per our meeting today, you will provide details of users, groups and permissions we need to implement, by the end of this week

gubuntu commented 2 years ago

@Mohab25 have you started implementing IAM as per WRO access management_V1_28Apr2022.pptx? (supplied by @mikev3003 via email on 28 April)

Mohab25 commented 2 years ago

@gubuntu As per referred email and our work with Ckan and Google Cloud IAM roles, the Authentication / Authorization model is based both on CKAN Auth model (https://docs.ckan.org/en/2.9/maintaining/authorization.html) and google cloud IAM roles, we are currently implementing the first phase of the final Auth model (Ckan Auth), this will be sufficient for the purposes of the specified email, we will be providing a gentle introduction with the upcoming sprint meeting.

gubuntu commented 2 years ago

@Mohab25 will investigate SSO options as per OP

Mohab25 commented 2 years ago

@gubuntu @mikev3003 , this is the first comments for WRO team suggested model, open for discussion, i will be taking all categories and points ( except the publicly accessible data = the last category which is obvious, although FAIR principles needs more clarification), notice the correspondence of the bulleted point to the WRO model, your points are highlighted with bold lines and the comments are beneath them:

Category1: Shared Internal Data | Accessible to the custodian and special permission cases onlyGCP access through ‘domain’ (e.g. wrc.org.za, up.ac.za, dws.gov.za), or individual email (e.g postgraduate student working on specific project): ◦ To control the accessibility to bucket objects from users outside google cloud who have specific domains and auth systems (up.ac.ze, dws.gov.za, ...etc) is implemented through a mechanism of SSO (single sign-on), Google cloud identity accounts and Access control lists, Google cloud defines the concept of external identity provider (Idp), through it, users can “use their existing identity and credentials to sign in to Google services”, the most basic way is to use the top level domain (e.g up.ac.ze, dws.gov.za, ...etc.) authentication system as an identity provider and communicate to it through SAML - Security Assertion Markup Language – standard (taken that the existing systems support this protocol), The general workflow becomes as follows: ▪ The user tries to access the object in the cloud storage. ▪ As the user is not authenticated, they will be redirected to google sign-in page and prompted to input their email address. ▪ Once that is done, google sign-in looks for the cloud identity associated with the email address.
▪ Because the single sign-on is enabled, the user will be redirected to the external Idp (this is where the SAML kicks-in, and an exchange of requests and responses happens between google sign-in and the external Idp). ▪ If authenticated by external Idp, the user is redirected to the bucket object. ◦ a prerequisite for accessing GCP resources is to have cloud identity accounts for each user in different external providers, provisioning these users (as opposed to add them manually) can be automated according to the existing condition: ▪ if there is an LDAP system in place ((Lightweight Directory Access Protocol)), google directory sync can be used (https://support.google.com/a/topic/2679497). ▪ If LDAP is absent, Google Admin SDK Directory API can be used to provision a large number of users. ▪ If the organization is not large CSV files (can handle up to 150,000 users at once ), can be used (https://support.google.com/a/answer/179832) ▪ Third party tools such as GAM can be used https://support.google.com/a/answer/10014088“other users” = users who are not within these domains, must contact the custodians to be granted access: ◦ the metadata is either the dataset metadata or the resource (file) metdata, for the second one, the metadata of the resource can be seen in CKAN site but not in GCP, because GCP doesn’t separate the accessibility of resource metadata and the resource itself. ◦ the action of granting the data by the custodians to users is outside the scope of the automation, if custodians decide to grant accessibility, the WRO GCP team and the CKAN site admin can grant the user accessibility (by giving them cloud identities and move them to specific access control list in GCP, and/or add them to specific organization or collaboration in CKAN) . • In some cases the custodian should have his own cloud resources (GCP projects or buckets), “next to” WRO resources to reduce costs: ◦ the cases in which the custodian must have their own cloud service is interpreted as when they are outside the given domains and want to upload data to WRO CKAN site. ◦ “next to” is interpreted as to have a link to the data (that resides in GCP buckets, AWS S3 or else) from CKAN site through resource links, the action of users having and uploading data to their own cloud resources is outside the scope of the WRO project, they should only be prevented from uploading files and only be able to provide links, all of this will be allowed only if they are either within an organization (in CKAN terms) or they are collaborators.

few refs: https://cloud.google.com/architecture/identity/single-sign-on https://cloud.google.com/architecture/identity/overview-google-authentication#cloud_identity_or_g_suite_account https://cloud.google.com/storage/docs/access-control/lists

Mohab25 commented 2 years ago

Category2: Shared Community Data | Accessible to the custodian and selected other partners within the WROGCP access through two or more domains (e.g. dws.gov.za + weathersa.co.za + arc.agric.za) ◦ this needs more clarification. ◦ this is controlled through the ACLs and CKAN orgs, but the indication is that it’s not one organization but multiple ones access the same data. • Can also create a group or project if an entire domain does not need access (e.g. specific directorate within a government department): ◦ we would exclude the intended departments from cloud identities creation phase (e.g. uploaded CSV files), if someone working in these departments, registered with the domain and don’t has a cloud identity, they won’t be granted access. • Other users can see metadata but must email custodian directly to be sent the data or be given access. ◦ the same as category1 point 3.

mikev3003 commented 2 years ago

@Mohab25 some clarification on how I think the system could work. I'm starting to think the bulk of the access management can be handled by CKAN, with a minority of special cases handled on a more case by case basis in GCP • Category 1: Shared Internal Data | Accessible to the custodian and special permission cases only

In CKAN, an organisation can specify datasets that are only available to members of that organisation. Their email address domain can be used to automate access to that organisation’s data. An example could be where the Department of Water and Sanitation (DWS) (dws.gov.za) want to store data that is only available to DWS employees. The organisations will need to appoint their own administrator(s) to manage any special cases.

A group can also be created for a subset of individuals within an organisation that have access to certain data/information. The group can be set-up by WRO administrators. The group will need to appoint its own administrator(s). Individual access can be granted on a case by case basis.

In GCP, a user can be granted access to one or more specific folders within the WRO storage bucket. Ideally this access will only be granted for a fixed amount of time and be reviewed again after that period expires. An example is when an employee of DWS needs to do a water reconciliation study using data from a variety of sources. If the individual will consume cloud resources, they should be encouraged to set-up their own GCP into which data or models can be seamlessly imported (unless the WRC later makes a facility available for such cases).

The metadata of all data uploaded using the CKAN platform should still be visible to all users. When data is not publicly available, the contact person for the dataset can be contacted to request access.


• Category 2: Shared Community Data | Accessible to the custodian and selected other partners within the WRO GCP access through two or more domains (e.g. dws.gov.za + weathersa.co.za + arc.agric.za).

Can also create a group or project if an entire domain does not need access (e.g. specific directorate within a government department).

In CKAN, an organisation can specify datasets that are only available to members of one organisation or a group of organisations. Their email address domain can be used to automate access to that organisation/group of organisations’ data. An example could be where the Agricultural Research Council (arc.agric.za), South African Weather Services (weathersa.co.za), and Department of Water and Sanitation (DWS) have shared access to weather data that is stored in the WRO. The organisations(s) will need to appoint their own administrator(s).

A group can also be created for several individuals who have access to otherwise restricted data/ information. An example is the National Siltation Programme that has a few individual members from the Water Research Commission (WRC), government departments, and universities. The group will need to appoint their own administrator(s).

In GCP, one or more specific users can be granted access to one or more specific folders within the WRO storage bucket. Ideally this access will only be granted for a fixed amount of time and renewed if necessary. An example is a PhD student from a university who wants to do big data analytics using data from a variety of sources. If the individual will consume cloud resources, they should be encouraged to set-up their own GCP into which data or models can be seamlessly imported (unless the WRC later makes a facility available for such cases).

The metadata of all data uploaded using the CKAN platform should still be visible to all users. When data is not publicly available, the contact person for the dataset should be contacted to request access.


• Category 3: Publicly available data GCP entity is public and fully available Findable, Accessible, Interoperable Reusable (FAIR) Principles applied All data available for immediate download. WRO metadata available for cataloguing in other repositories.