jupyterlab / jupyterlab-metadata-service

Linked data exploration in JupyterLab.
BSD 3-Clause "New" or "Revised" License
29 stars 16 forks source link

Metadata Visibility #9

Closed dancastellani closed 5 years ago

dancastellani commented 5 years ago

Use Cases 1) Open Data datasets are OK to be listed and explored without permission. They have public metadata and data. 2) Private datasets should be listed on the catalog and most of the times the metadata is OK to be listed, but the data requires access. 3) Some times, even the metadata should not be publicly visible or listed, such as for Title 13 or Title 26 datasets from Census. Those have the metadata and data private, requiring access to be granted before the data or metadata is viewed.

Note on metadata: As metadata is super generic and could be anything, we are using listing and detailed metadata. Listing metadata is most of the time not sensitive and can be used for search and listed on the catalog. Detailed metadata on the other hand might be considered sensitive as it could potentially contain information that should not be disclosed, such as variable names and types, categorical values, geographical and temporal coverage, etc.

Approach 1: Use three visibility flags: listing, (detailed) metadata and data. With pre-defined listing and detailed metadata. When a dataset can be listed (listing = PUBLIC) the title, description and maybe keywords should be displayed on the catalog. When listing is PRIVATE, the dataset is simply not listed at all on the catalog and only specific groups would have access to discovering it. On ADRF the goal is to maximize the datasets that can be discovered. Likewise, the detailed metadata and data could be either PUBLIC or PRIVATE. PUBLIC indicates that the information could be listed on the catalog while PRIVATE requires permission to see the information.

This seems to be enough for most cases and it is probably a good simplification for initial implementation phases if not a good solution for most cases.

Approach 2: Flexible listing and detailed metadata In some more specific cases such as the Title 13/26 datasets, in which a more tailored permission is required for the metadata - usually the access approval is required to access data - more flexibility is necessary. In those cases, separating the metadata into listing and detailed metadata as two separate structures could be enough. For example, if the listing and detailed metadata are stored into two JSON objects, they could grow and adapt to multiple schemas and “metadata permissions”. However if the same information appears in both they can become out of sync. Thus, validation is required to guarantee that the same metadata is not on both collections. This trade-off in validation (no metadata key should appear on both at the same time) seems reasonable for the flexibility achieved.

Since more flexible and not much more complicated, approach 2 could be a better implementation even to start with.

Note on dataset Comments and Annotations In the context of private data, comments and annotations can contain sensitive information such as the variable values or be harmful in some way to the data provider, individuals on the data or the community in general. Harmful comments are non-productive and might use an aggressive tone against the provider, can contain bad language or disclose information about individuals on the dataset that can have a negative impact. Therefore, they should be moderated before being published either publicly or to the private group that has access to the dataset. Inside members of a project it could potentially be OK to share un-moderated comments, but this still require some discussion. Comments and annotations by default should have the same visibility as the dataset data. Requiring access to be granted when the dataset requires so. During review, some comments/annotations can possibly be deemed OK to be shared with people that don’t have access to the data, and in this case, they should have the same visibility as the either the medata or listing. Comments and annotations can be for datasets, metadata or data. They can also be in response to another comments/annotation. In the case of Jupyter, they could also be related to cells and visualizations. In this case the visibility of the comments should probably follow the data restrictions.

Note on data provider (ADRF) and Jupyter integration: Most of the complexity to curate the dataset falls into the Data Provider side (e.g. ADRF) that needs to maintain an updated version of the data and metadata in addition to send to Jupyter only the datasets that the given user has access to respecting ACLs and context (current user and project on the case of ADRF). The Jupyter interface will display only datasets received by the Data Provider. In this sense the Jupyter component dealing with metadata will need to be flexible since the beginning to be able to deal with different metadata definitions. Therefore, an implementation closer to approach 2 is more suited for it.

saulshanabrook commented 5 years ago

We might need a way to make the comment and metadata service pluggable. i.e. if someone comes in with their own custom auth schema we don't wanna include all that logic in JupyterLab. I could see someone wanting to hook into the comments and metadata system, but storing things in a different place. Maybe they already have their own database and would like to use that.

Users might need to be able to compose these as well. Like comments on some things are stored in one system, other things on another. Like how we have filebrowsers be pluggable in core.

Seems like we have two ways of supporting this: