Retrieve datasets to merlin

sbliven commented 1 year ago

Feature Request

We would like to add an option to retrieve datasets to merlin. Currently there is a 'PSI-ra' option when retrieving from scicat. We would like to support similar functionality for merlin and other central archiving locations.

Ra implementation

(Please edit if any of this information is incorrect)

The current PSI-ra retrieval workflow is as follows:

Each ra pgroup has a 'retrieve' directory owned by the retrieval service user

SciCat creates a retrieval job:

{
"id": "c0a7cab3-acd7-4474-be75-b81024c775c8",
"emailJobInitiator": "spencer.bliven@psi.ch",
"type": "retrieve",
"jobParams": {
  "username": "oidc.bliven_s",
  "destinationPath": "/archive/retrieve",
  "option": "PSI-RA"
},
"jobStatusMessage": "finishedSuccessful",
"datasetList": [
  {
    "pid": "20.500.11935/a1704aba-285b-4f95-b48d-36a10930694f",
    "files": []
  }
],
"jobResultObject": {
  "result": {
    "rc": "0",
    "jobid": "76033"
  }
}
}

Arima fetches the data from tape, places it in /das/work/<pgroup>/retrieve/<user>/<pid> and reports success
users copy/move the data to the desired destination

Permissions rely on ACLs to allow both the service use and the pgroup members to access the directory.

Differences to merlin

Merlin does not use DUO or pgroups. Most users use a-groups and may archive from user directories or project directories, which do not correspond 1:1 with a-groups. This means that a mechanism must be added to allow users to select a path when retrieving a dataset.

Implementation steps

The minimal implementation in the backend would require:

A way to grant the service user write access to the destination folder.
- At first this could be a fixed retrieve directory for each project like ra
- Better would be a script that would set the appropriate permissions/acls on whatever directory the user specified. This could be incorporated into the datasetRetriever tool, and could validate some permissions at run time (e.g that the user has permission to read the dataset and permission to write to the destination folder to clean up).
Modify Job model in REST api to capture destination server and path
Modify Arima to write to the correct server and path

Front-end changes:

datasetRetriever modifications to set up the directory, validate settings, and pass the correct paths to SciCat
New SciCat retrieval option with a field for the destination
(Optional) File browser on SciCat to select the files. This would probably require a microservice running somewhere with access to all the central filesystems which would validate user permissions and return file lists.

minottic commented 1 year ago

@sbliven when would you need this to be implemented? It will likely need a meeting with Krisz, Pedro and Michael (and us). Could you please schedule it depending on its urgency? Thanks.

sbliven commented 1 year ago

Here's an initial diagram for how the microservice I mention above might work. This "storage service" would run on the storage system and provide endpoints for the following queries:

Check if a filesystem is mounted centrally from this storage
List writable filesystem for a particular user
File browser/navigation (basically wraps ls and cd for central locations, taking user permissions into account)

SciCat would also need to implement an endpoint for checking what storage systems a user has access to (looking ahead to having non-PSI users in the system)

storage_service_flowchart

paulscherrerinstitute / scicat-ci