simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

(De)duplicating data on grid storage systems #39

Closed ickc closed 4 months ago

ickc commented 7 months ago

As users are starting to share files with each other, we are now seeing people copying files within our VO (at root://bohr3226.tier2.hep.manchester.ac.uk:1094//dpm/tier2.hep.manchester.ac.uk/home/souk.ac.uk/).

What are the best practice here? We want to be able to not costing n copy of files as n collaborators are making copies.

For example, when gfal-copy, behind the scene, would it costs 2 times as much storage? Or was it doing COW (Copy On Write) behind the scene? If not COW, is there any way we can make some sort of hard-link so that the copy is cheap, such as the cp --reflink behavior?

Would the answers to these be different once we migrate from DPM to Ceph?

Thanks.

rwf14f commented 7 months ago

The current storage does not support COW. Each logical file path is a physical file on one of the storage servers and requires the full amount of disk space.

Afaik, CephFS (the mountable Posix filesystem part of Ceph) doesn't support COW / reflinks either. There's a 12 year old feature request to support reflinks, but it doesn't look like it's being worked on.

ickc commented 7 months ago

Thanks, how about hard link and soft link? Also, is there any way to put a per user limit on disk quota?

I will document this in the next release.

We (DC people) should think more about disk usage policy and how to deal with duplications.

rwf14f commented 7 months ago

There are no hard / soft links on the current storage either. It is possible to have per directory quotas, but this will be cumbersome to add to the existing storage and enforce it.

CephFS supports hard and soft links. Please be aware that there might be performance issues (see app best practices). CephFS also uses directory quotas.

ickc commented 7 months ago

Great. Thanks.

This implies that in order to enforce per user disk quota, we need to enforce a per user directory convention, say data/$USER or home/$USER.

We need to give more thoughts on how to share data. May be ignore the hard/soft link there (because it requires disciplines from users anyway) and just enforce per user quota, then provide another namespace such as project for collaboration-wide sharing with write permission to only some maintainers? (How?)

I probably will make a proposal in the form of documentation and discuss it in one of our internal weekly meetings.