neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

New dataset `lumbar-vanderbilt` #284

Closed NathanMolinier closed 5 months ago

NathanMolinier commented 6 months ago

Description

I just pushed a new dataset lumbar-vanderbilt on our git-annex server data. The new data is on a branch called nm/first-commit.

This dataset was shared by one collaborator in the context of gray matter segmentation for the lumbar region. This dataset contains:

Before merging, I believe we should wait for possible changes related to our data curation convention.

jcohenadad commented 5 months ago

Strangely, when I git clone the repos, I see the size is 1GB without running the git-annex get:

julien-macbook:~/data.neuro $ git clone git@data.neuro.polymtl.ca:datasets/lumbar-vanderbilt
Cloning into 'lumbar-vanderbilt'...
remote: Énumération des objets: 1866, fait.
remote: Décompte des objets: 100% (1866/1866), fait.
remote: Compression des objets: 100% (1016/1016), fait.
remote: Total 1866 (delta 379), réutilisés 1843 (delta 374), réutilisés du pack 0
Receiving objects: 100% (1866/1866), 1.09 GiB | 58.13 MiB/s, done.
Resolving deltas: 100% (379/379), done.

Is that normal?

NathanMolinier commented 5 months ago

Is that normal?

With Mathieu we fixed the issue. I just forgot to call git annex before adding the data. The branch nm/first-commit was therefore deleted.

A new branch nm/first-commit-2 should now be available. The data was also updated to follow our new conventions. This branch is now ready to merge.

jcohenadad commented 5 months ago

I just forgot to call git annex before adding the data.

it's not the first time this happens, and it will likely happen again in the future. I'm wondering if there is any check we can do to monitor when this happens, eg: a cron job running on all the 'dataset' git repos and that checks if binaries are physically present in the .git instead of the .gitannex folder, or something like that? @mguaypaq @kousu

kousu commented 5 months ago

datalad's default configuration annexes every single file; if you configure it to split them up with .gitattributes like we do (and like only makes sense to do) they have this problem too. It's a basic result of stacking too many layers in one tool. We should reconsider #68

In the meantime we can write some fscking scripts, ti catch these issues, and yes we should do that, but they would be a stop gap. Still, useful to get the alert at least! Maybe that's something @namgo has capacity for, come to think of it.

mguaypaq commented 5 months ago

Possibly I can add a git hook to the new repository template, that would refuse pushes with non-annexed files?

kousu commented 5 months ago

You might struggle because hooks are per instance of a git repo, they don't get cloned, and if we ever use Gitea hooks don't get copied when a repo is forked. Would you just patch it into that shell script we copy-paste for people?

mguaypaq commented 5 months ago

I'm willing to re-fix the problem differently once Gitea is up and running. And I need to do manual intervention right now for every new gitolite repository, so if hooks work then that's fine.

Also, I think Gitea templates allow copying hooks? I haven't looked into it, but I remember that there were a bunch of checkboxes of "what to copy" when I used a new repo template on spineimage.ca.

kousu commented 5 months ago

Ah true. Okay sounds good!

mguaypaq commented 5 months ago

Back to the main topic for this issue: the branch nm/first-commit-2 looks good from my end:

So, I merged into master and deleted the branch nm/first-commit-2.