neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Working with some git-annexed datasets on `~/duke/temp` is extremly slow #304

Closed valosekj closed 2 months ago

valosekj commented 3 months ago

Working with some git-annexed datasets on ~/duke/temp is extremly slow. A simple git status can take >12s.

canproco - super slow: >12s

p118175@joplin:~/duke/temp/janvalosek/ssl_pretraining_multiple_datasets/canproco$ time git status
On branch master
Your branch is up to date with 'origin/master'.

It took 12.01 seconds to enumerate untracked files.
See 'git help status' for information on how to improve this.

nothing to commit, working tree clean
git status  0.29s user 4.14s system 33% cpu 13.419 total

dcm-zurich - slow: >2s

p118175@joplin:~/duke/temp/janvalosek/ssl_pretraining_multiple_datasets/dcm-zurich$ time git status
On branch master
Your branch is up to date with 'origin/master'.

It took 2.27 seconds to enumerate untracked files.
See 'git help status' for information on how to improve this.

nothing to commit, working tree clean
git status  0.06s user 0.52s system 18% cpu 3.257 total

sci-colorado - fine

p118175@joplin:~/duke/temp/janvalosek/ssl_pretraining_multiple_datasets/sci-colorado$ time git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
git status  0.00s user 0.22s system 13% cpu 1.619 total
mguaypaq commented 3 months ago

Never work with git-annexed datasets on duke. The file permissions on duke are broken in a way git-annex doesn't expect, and repositories get silently corrupted.

Since Nick added a bunch of large hard drives on several servers, there should be enough space to work with large datasets inside your home directory.

I would recomment not pushing from these clones under duke/temp, because the results may be strange. Instead, make a new clone outside of duke, run git annex dead here, and copy the modified files over.

valosekj commented 3 months ago

Sorry, I should have mentioned that I intend to use ~/duke/temp only to read the datasets to train DL models. I do not plan to modify them or push. I choose ~/duke/temp because both @naga-karthik and I need access to the same directory with all the cloned datasets.

jcohenadad commented 3 months ago

even with read access duke should not be used for data analysis— duke HDs are designed for storage.

for analysis, always use local HD on servers

if you’d like to work with other ppl on the same dataset, you can use a local shared folder (eg /scratch, if there is one)— @mguaypaq can you pls check if there is a scratch and it is documented on neuro.internal— if not pls tag Nathan or Emma— alternatively to scratch, have home folder be readable by other students— i let Nathan Emma decide what’s best— thx

mguaypaq commented 3 months ago

I don't think we have any "shared temp" spaces set up right now, other than duke/temp/.

@namgo and @nullnik-0, do you have any thoughts on the logistics (physical storage space, easy to use, self-service) for shared folders? In theory that's what unix groups are supposed to be about, but I've never seen it work in practice, the details are tricky, and it involves a lot of manual intervention from sysadmins.

I've seen someone do an inter-user file transfer at the lab through /tmp folders and getfacl/setfacl, but /tmp shouldn't be used for large, long-lived folders. (And setfacl permission are hilariously easy to not notice and/or forget, leading to much head-scratching later on.)

Would it be possible to set up something like /home/scratch/, which uses the same storage as /home, and everyone can read/write, and their files still count against their disk quota?

jcohenadad commented 3 months ago

for the sake of consistency with digital alliance's modus operandi, i suggest putting scratch/ at the root

namgo commented 3 months ago

/scratch should be doable on joplin. I agree with Mathieu in that access lists are maybe not the ideal way to go.

@valosekj would global read/write be okay for this for now? Even if yes, we'll need a better long-term solution.

valosekj commented 3 months ago

@valosekj would global read/write be okay for this for now? Even if yes, we'll need a better long-term solution.

Yes, I guess so. The folder will contains our git-annex datasets (i.e., nothing private)

namgo commented 3 months ago

@valosekj I'm looking into this now and like Mathieu was hinting at... this is a trickier problem than it appears at first.

Would it be more sensible to collaborate within our gitea server for data sharing?

namgo commented 3 months ago

I can look into setting up access control lists to ensure that you and Naga get shared permissions on a folder. I just want to confirm that what we have already isn't going to work for your needs.

valosekj commented 3 months ago

Would it be more sensible to collaborate within our gitea server for data sharing?

Can you please explain a little bit more what you mean by this solution?

Basically, we need a single directory that we can both access from romane.

naga-karthik commented 3 months ago

Just following-up on this -- it is starting to get more priority as we have common datasets to use across projects. Is there any update on what's the final solution we're going ahead with?

namgo commented 2 months ago

I talked to Mathieu and I was definitely wrong about the idea of using gitea for data sharing, it wouldn't fit your needs at all.

I'm assigning this high priority.

Could you let me know how much space you'll need?

valosekj commented 2 months ago

Could you let me know how much space you'll need?

@naga-karthik and I believe that 500GB should be good start.

namgo commented 2 months ago

Yeah! I heard from Naga about this. If in the future you need more scratch space we might do network mounts.

For now, I think I'm super close to a solution, just confirming it with @nullnik-0.

namgo commented 2 months ago

It appears to work as expected.

Tracked in https://github.com/neuropoly/computers/issues/703 and I've now added /home/scratch.img to fstab.

valosekj commented 2 months ago

Thank you @namgo ! 🚀