usegalaxy-eu / project-ideas

A collection of project ideas suitable for Master and Bachelor students
MIT License
9 stars 2 forks source link

Investigating and benchmarking distributed filesystems for the European Galaxy server #21

Open bgruening opened 3 years ago

bgruening commented 3 years ago

Investigating and benchmarking distributed filesystems for the European Galaxy server


Supervisor: Gianmauro Cuccuru, Compute Center Freiburg For degree: Bachelor/Project/Master Status: Open Keywords: S3, NetApp, OneData, iRODS

Global Research context

With the Pulsar network (https://pulsar-network.readthedocs.io), we have a distributed compute infrastructure in places that can schedule compute jobs across the globe. The logical next step is to find a distributed storage component that is reliable and scalable across different clouds and HPC centers - and ultimately integrate this into the European Galaxy infrastructure.

Project context

S3, OneData and iRODS are three candidates that should be evaluated in the context of the European Galaxy server use case. We have access to all 3 technologies, OneData & iRODS even as a distributed European deployment. We are aiming in benchmarking those solutions and evaluate which one is the best for our use-case.

Proposed agenda for the project

  1. develop an automatic benchmark procedure for S3, OneData and iRODS for a few typical use cases:
    • writing small files
    • writing big files
    • reading small files
    • reading big files
    • local filesystem
    • remote filesystems in other countries
    • different file-formats, hdf5 vs zarr vs. netcdf
  2. check the failure tolerance, by running the automatic benchmarking and tearing down storage locations
  3. Setup Galaxy to use those extended object stores
    • S3 natively
    • OneData natively
    • iRods natively
  4. (integrate into the Pulsar Network)

Prerequisites