paulscherrerinstitute / scicat-ci

CI related information to deploy SciCat
BSD 2-Clause "Simplified" License
4 stars 2 forks source link

Retrieve datasets to merlin #149

Open sbliven opened 1 year ago

sbliven commented 1 year ago

Feature Request

We would like to add an option to retrieve datasets to merlin. Currently there is a 'PSI-ra' option when retrieving from scicat. We would like to support similar functionality for merlin and other central archiving locations.

Ra implementation

(Please edit if any of this information is incorrect)

The current PSI-ra retrieval workflow is as follows:

  1. Each ra pgroup has a 'retrieve' directory owned by the retrieval service user
  2. SciCat creates a retrieval job:
    {
    "id": "c0a7cab3-acd7-4474-be75-b81024c775c8",
    "emailJobInitiator": "spencer.bliven@psi.ch",
    "type": "retrieve",
    "jobParams": {
      "username": "oidc.bliven_s",
      "destinationPath": "/archive/retrieve",
      "option": "PSI-RA"
    },
    "jobStatusMessage": "finishedSuccessful",
    "datasetList": [
      {
        "pid": "20.500.11935/a1704aba-285b-4f95-b48d-36a10930694f",
        "files": []
      }
    ],
    "jobResultObject": {
      "result": {
        "rc": "0",
        "jobid": "76033"
      }
    }
    }
  3. Arima fetches the data from tape, places it in /das/work/<pgroup>/retrieve/<user>/<pid> and reports success
  4. users copy/move the data to the desired destination

Permissions rely on ACLs to allow both the service use and the pgroup members to access the directory.

Differences to merlin

Merlin does not use DUO or pgroups. Most users use a-groups and may archive from user directories or project directories, which do not correspond 1:1 with a-groups. This means that a mechanism must be added to allow users to select a path when retrieving a dataset.

Implementation steps

The minimal implementation in the backend would require:

  1. A way to grant the service user write access to the destination folder.
    • At first this could be a fixed retrieve directory for each project like ra
    • Better would be a script that would set the appropriate permissions/acls on whatever directory the user specified. This could be incorporated into the datasetRetriever tool, and could validate some permissions at run time (e.g that the user has permission to read the dataset and permission to write to the destination folder to clean up).
  2. Modify Job model in REST api to capture destination server and path
  3. Modify Arima to write to the correct server and path

Front-end changes:

  1. datasetRetriever modifications to set up the directory, validate settings, and pass the correct paths to SciCat
  2. New SciCat retrieval option with a field for the destination
  3. (Optional) File browser on SciCat to select the files. This would probably require a microservice running somewhere with access to all the central filesystems which would validate user permissions and return file lists.
minottic commented 1 year ago

@sbliven when would you need this to be implemented? It will likely need a meeting with Krisz, Pedro and Michael (and us). Could you please schedule it depending on its urgency? Thanks.

sbliven commented 1 year ago

Here's an initial diagram for how the microservice I mention above might work. This "storage service" would run on the storage system and provide endpoints for the following queries:

SciCat would also need to implement an endpoint for checking what storage systems a user has access to (looking ahead to having non-PSI users in the system)

storage_service_flowchart