Easily access datasets on Rucio data lake

matbun commented 5 months ago

Add a Python function capable of translating a namespaced Rucio dataset/file to the absolute path on the local filesystem of the datacenter (e.g., HPC) on which the code is currently running.

Sth like namespace_to_path('jdoe:physics_dataset') returning:

'/dacache/slling.si/.../physics_dataset' when on HPC1
'/other/path/.../physics_dataset' when on HPC2

matbun commented 5 months ago

@garciagenrique

garciagenrique commented 5 months ago

Hello @matbun,

After speaking with few people at CERN, there are two "main" way to interact with RUCIO data.

Download the desired dataset into the localhost.
Make a replication rule so that the files are available within the "local" RSE (RUCIO Storage Element), i.e., the distributed storage that should exists on each of the data centers. (And that should be mounted when you are logged in).

Option 1 takes much more time that option 2. Furthermore, you would need to keep an internet connection open during the whole download.

Therefore, we should go with option 2.

I can already create a small bash script for VEGA that simlinks all the dataset files into a txt file, that we would need to adapt for each of the data centers. Step by step ;-).

Let me know where I can add this script within itwinai.

matbun commented 5 months ago

I have created a new tutorial folder on a new branch: https://github.com/interTwin-eu/itwinai/tree/156-easily-access-datasets-on-rucio-data-lake/tutorials/data-lake/pull-dataset

@garciagenrique could you please add an example of "option 2" with some documentation? The goal is giving such example to the interTwin use cases, so that they can reproduce it for their datasets. Perhaps a couple of links to Rucio docs would help as well.

Thanks!

interTwin-eu / itwinai

Easily access datasets on Rucio data lake #156