Closed Brilator closed 1 year ago
Hey, sry for the confusion. Actually this works as expected for me. But as you said, the description is missing. Maybe we could use this issue here to discuss whether my first implementation suffices or not.
As it is now, the .gitattributes
file gets newly created when using arc sync
. The rules written into it stem from two sources:
1) If a file has a bigger size than the lfs byte threshold
This threshold is by default set to 150MB and can be edited using
arc config set -k general.gitlfsbytethreshold -v <NumberOfBytes>
2) More general git lfs rules
By default, only the **/dataset/**
rule you mentioned is set
This can be extended by using arc config set -k general.gitlfsrules -v <NewRuleSet>
I think 1)
is fine as is, but 2)
has the problem of being a list of rules
, but being handled as a string
by the set
function. So in principle, in order to add a rule without removing the exisiting rules, you would need to provide all of them as an concatenaded string. E.g.: arc config set -k general.gitlfsrules -v "**/dataset/**;**/externals/**"
.
Maybe this could be alleviated by a config command
that handles list. E.g. arc config add
or arc config listset
I'm not sure whether I'd allow the .gitattributes
to be directly editable by the user, as this would again create multiple points of overlapping information, that need to be kept synchronous somehow. Rather, what I explained here could be relayed to the user at different points.
Some further suggestions, opinions or general input would be appreciated =)
The point "A clear and concise description of what you expected to happen." was a leftover of the GitHub issue template...
To 1): yes, that's good. I'm glad to have an easy solution with a threshold no matter the data format. However, this is currently not working on my end (at least not within https://gitlab.nfdi4plants.de/). There are files ≤ 200 MB being synced.
Another problem -and I might have just found a corner-case: some large machine vendor "raw data files" are actually directories with subdirectories collecting e.g. small texts, xmls and larger binaries. And I wouldn't want to sync (the smaller) half of that raw file directory.
I'll invite you to the sample ARC to show the example (in this case *.D/ coming from Agilent mass specs): https://gitlab.nfdi4plants.de/brilator/samplearc_metabolomics/-/tree/main/assays/Talinum_GCMS_minimal/dataset/
Just to rule out some misunderstanding (based on your wording): The rules written in the .gitattributes
file will not stop the files from being synced. Rather they will be handled by git lfs
and synced to the associated lfs storage.
So files >= 150MB will still be synced, but should be flagged with the LFS
tag online
Sorry, ignore my last comments... The syncing went better with arcCommander's arc sync than what I managed some time ago with bare git lfs and that confused me.
No problem, my initial intuition of how lfs is supposed to work was also quite different than what it actually does 😄
Describe the bug I've tracked a filetype via git lfs (added to .gitattributes).
To Reproduce Steps to reproduce the behavior:
arc sync
to sync with remoteExpected behavior
**/dataset/** filter=lfs diff=lfs merge=lfs -text
suffices for all required cases. Some larger files might also be produced in runs.