nfdi4plants / ARCCommander

Tool to manage your ARCs
MIT License
11 stars 9 forks source link

[BUG] arc sync overwrites existing .gitattributes #96

Closed Brilator closed 1 year ago

Brilator commented 2 years ago

Describe the bug I've tracked a filetype via git lfs (added to .gitattributes).

To Reproduce Steps to reproduce the behavior:

  1. Use arc sync to sync with remote
  2. Find overwritten .gitattributes file.

Expected behavior

HLWeil commented 2 years ago

Hey, sry for the confusion. Actually this works as expected for me. But as you said, the description is missing. Maybe we could use this issue here to discuss whether my first implementation suffices or not.

As it is now, the .gitattributes file gets newly created when using arc sync. The rules written into it stem from two sources: 1) If a file has a bigger size than the lfs byte threshold This threshold is by default set to 150MB and can be edited using arc config set -k general.gitlfsbytethreshold -v <NumberOfBytes> 2) More general git lfs rules By default, only the **/dataset/** rule you mentioned is set This can be extended by using arc config set -k general.gitlfsrules -v <NewRuleSet>

I think 1) is fine as is, but 2) has the problem of being a list of rules, but being handled as a string by the set function. So in principle, in order to add a rule without removing the exisiting rules, you would need to provide all of them as an concatenaded string. E.g.: arc config set -k general.gitlfsrules -v "**/dataset/**;**/externals/**" . Maybe this could be alleviated by a config command that handles list. E.g. arc config add or arc config listset

I'm not sure whether I'd allow the .gitattributes to be directly editable by the user, as this would again create multiple points of overlapping information, that need to be kept synchronous somehow. Rather, what I explained here could be relayed to the user at different points.

Some further suggestions, opinions or general input would be appreciated =)

Brilator commented 2 years ago

The point "A clear and concise description of what you expected to happen." was a leftover of the GitHub issue template...

Brilator commented 2 years ago

To 1): yes, that's good. I'm glad to have an easy solution with a threshold no matter the data format. However, this is currently not working on my end (at least not within https://gitlab.nfdi4plants.de/). There are files ≤ 200 MB being synced.

Brilator commented 2 years ago

Another problem -and I might have just found a corner-case: some large machine vendor "raw data files" are actually directories with subdirectories collecting e.g. small texts, xmls and larger binaries. And I wouldn't want to sync (the smaller) half of that raw file directory.

I'll invite you to the sample ARC to show the example (in this case *.D/ coming from Agilent mass specs): https://gitlab.nfdi4plants.de/brilator/samplearc_metabolomics/-/tree/main/assays/Talinum_GCMS_minimal/dataset/

HLWeil commented 2 years ago

Just to rule out some misunderstanding (based on your wording): The rules written in the .gitattributes file will not stop the files from being synced. Rather they will be handled by git lfs and synced to the associated lfs storage.

So files >= 150MB will still be synced, but should be flagged with the LFS tag online

Brilator commented 2 years ago

Sorry, ignore my last comments... The syncing went better with arcCommander's arc sync than what I managed some time ago with bare git lfs and that confused me.

HLWeil commented 2 years ago

No problem, my initial intuition of how lfs is supposed to work was also quite different than what it actually does 😄