Create a workflow for anonymous review of a dataset

kosarko commented 2 years ago

Use case: You are submitting an article describing a dataset (that you want to host in the repository). The article goes through a review process. The dataset should be available to reviewers. The authors and most other metadata should remain hidden. The reviewers might have comments (in the review platform, not in the repository) about the dataset. The author changes the dataset following the review.

Requirements:

[ ] Reviewers should get a PID - we might not need to keep the original dataset forever, but links rot quickly (repo system updates, URL changes...).
[ ] What happens to the dataset draft is unspecified (keep/delete). Whatever happens, in the end, should be a well-documented and consistent process.

Suggestions: Maybe fiddling with the private items feature and moving the preliminary dataset into a hidden bundle would do the trick.

stranak commented 1 year ago

A real life exmple, colleagues right now submitting a paper the the Glossa journal. From the guidelines:

Data Availability/ Supplementary Files (if applicable) The journal requires authors to make all data associated with their submission openly available, according to the FAIR principles (Findable, Accessible, Interoperable, Reusable). More information can be found on the Journal Policies page. If data/supplementary files are to be associated with the submission, one of the below options should be followed: 1) upload the files to your chosen open repository and make note of the DOI that they will provide (most suitable for datasets or information that act as foundations to the research being published. This option makes the files more findable and more citable). We recommend an open repository such as osf.io, which allows you to create a "project" under which you can upload relevant files (datasets, analysis scripts, experimental materials, etc.). The project will be associated with a unique DOI. You can then include in your manuscript a citation of the OSF entry and/or a link to the project page on OSF, to direct interested readers to the supplementary materials. During review, please be sure that the link to the repository is anonymized to maintain a fully double masked review process. Instructions for doing this on the OSF may be found here. If you'd like to learn more about best practices for ensuring reproducibility, see Laurinavichyute and Vasishth (2021). Please contact us if you would like more information or advice about hosting your data on an open repository.

In the above text there is also a very relevant link to how the OSF allows to create "view-only links" and in that dialogue asks whether to anonymize the view.

stranak commented 1 year ago

@vidiecan @kosarko Do you guys think it makes sense to implement it now, or shall we solve one or two records in this type of use case manually for now (private record, manual anonymisation, some way of view permission, e.g. "reviewer/review) and postpone this for the new version with the new UI?

vidiecan commented 1 year ago

Latest usecase requirements:

[ ] after submitting/(our internal) review, the item should be accessible only via link (not via search, oai-pmh, ...)
[ ] should have PID
[ ] hide specific metadata - author names, affiliations, ...
[ ] include information in provenance why the changes are happening (in order to not be confused in one year when we see strange upload/change of data/metadata by our curators)
[ ] (optional) verify author(s) are not mentioned in the data itself

And once the item is to be published, we should show the hidden metadata, make it public, update provenance

kosarko commented 1 year ago

There are some drawbacks to the current (manual) approach, where we remove the metadata (and later add them back). Namely curation and exports. Maybe we should use the word anonymized (or similar) to indicate the value is known but redacted. The workflow should take into account especially:

refbox (suggested citation)
oai-pmh exports (e.g. ELG expects provider/publisher)
curation tasks checking metadata completeness etc.

stranak commented 1 year ago

There are some drawbacks to the current (manual) approach, where we remove the metadata (and later add them back). Namely curation and exports.

Completely agree that it must be "there" (filled-in already), but hidden.

Maybe we should use the word anonymized (or similar) to indicate the value is known but redacted. The workflow should take into account especially:

refbox (suggested citation) oai-pmh exports (e.g. ELG expects provider/publisher) curation tasks checking metadata completeness etc.

OR ... these features should be all simply disabled. E.g. there is no good reason to cite the dataset under review, i.e. unpublished. On the contrary, we should discourage it. The same goes for any harvesting, imho.

stranak commented 4 months ago

I have added High Priority, because we have seen the usecase several times lately and there is also a request from CU for this.

ufal / clarin-dspace

Create a workflow for anonymous review of a dataset #1020