metoppv / improver

IMPROVER is a library of algorithms for meteorological post-processing.
http://improver.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
103 stars 85 forks source link

Add Git LFS for Binary Data Handling (acceptance test data) #1974

Closed cpelley closed 3 weeks ago

cpelley commented 9 months ago

Description

This change has been on my radar for a significant time. The test data being made public being a pre-requisite for realising this change, hence only now having this PR. If approved, we can then eventually remove https://github.com/metoppv/improver_test_data (perhaps that repo. is kept for historic purposes i.e. read-only so no new changes made).

Acceptance data comes form https://github.com/metoppv/improver_test_data/commit/44351086112b072dde0ccd377bb780280a3abf07 (to be updated to whatever is the latest commit from improver_test_data) data before merging).

Purpose

This pull request aims to enhance the repository's handling of binary data, specifically in the 'resources' directory under the 'tests' folder. Git LFS (Large File Storage) is introduced to efficiently manage large binary files, providing benefits such as faster cloning, reduced storage requirements, and improved overall repository performance.

Changes Made

Benefits of Using Git LFS

How Git LFS Works

Git LFS replaces large files with text pointers in the repository, while the actual binary content is stored externally. This allows Git to handle binary data more efficiently, without affecting the core functionality of version control. The .gitattributes file is used to specify which files should be managed by Git LFS.

End-user experience/impact

Testing

Checklist

Issues

github-actions[bot] commented 5 months ago

In order to maintain a backlog of relevant PRs, we automatically label them as stale after 60 days of inactivity.

If this PR is still important to you, then please comment on this PR and the stale label will be removed.

Otherwise this PR will be automatically closed in 30 days time.

PaulAbernethy commented 5 months ago

bump

cpelley commented 5 months ago

I have updated the README.md with details about git LFS.

image

cpelley commented 5 months ago

Ultimately, it would make sense to move away from file checksum comparisons for IMPROVER acceptance testing (but I wont suggest radically overhauling the testing in the same PR, unless we abs. need to).

I'm not familiar with the testing framework of IMPROVER, @gavinevans do you have any ideas as the the reason behind the checksum failures?

I have rebased this dev. branch against latest IMPROVER master, and pulled in git LFS test data from https://github.com/metoppv/improver_test_data/commit/44351086112b072dde0ccd377bb780280a3abf07 (i.e. latest).

bayliffe commented 4 months ago

Thanks for this Carwyn. I can see the advantages of this approach, particularly with @gavinevans suggestions allowing us to run the acceptance tests as part of the actions (though we should be mindful of the number of actions minutes this might consume).

My experience of using this is not quite as described. Checking out the branch, or cloning the repository clean in single branch mode (as if this branch were master: git clone -b git_lfs_test_data --single-branch git@github.com:metoppv/improver.git) both led to all of the acceptance test data files being retrieved immediately. The resources directory is ~150MB and I could interrogate the data immediately, without performing the suggested git lfs specific steps. Gavin found his resources directory to be somewhat smaller than the current acceptance test data repository. Perhaps I'm just special, but I'd appreciate any thoughts on why this is?

cpelley commented 4 months ago

My experience of using this is not quite as described.

😱@bayliffe, thanks for doing this experiment. Yeah, so nothing is required to get the data, just seamless without extra steps. I have no idea why I was under the impression an explicit git lfs fetch was needed now 🤷 I'll update the documentation to reflect this.

If desiring to not pull down data on clone:

GIT_LFS_SKIP_SMUDGE=1 git clone ...

Or globally:

git lfs install --skip-smudge

Gavin found his resources directory to be somewhat smaller than...

Thanks @bayliffe, @gavinevans, I'll investigate and check back in with details of my findings 👍

cpelley commented 4 months ago

The resources directory is ~150MB... Gavin found his resources directory to be somewhat smaller than...

Looks good to me. I get 76.2MB for data under this branch and for data in https://github.com/metoppv/improver_test_data/commit/44351086112b072dde0ccd377bb780280a3abf07

cpelley commented 4 months ago

@gavinevans, @bayliffe, all the acceptance tests pass for me locally. I also included an LFS caching support to mitigate against concerns with increased footprint. I see that the conda environment yaml files are pinned but don't specify exact builds. This will certainly lead to differences in environment compared to our pre-built one locally. I suspect this to be the route cause of the remaining failures.

For now, I'm going to opt for falling back to excluding the acceptance tests in actions since they weren't run before (was a bonus). I'll leave in the github action workflow refactor and LFS caching in-place for when someone comes back to enabling the acceptance tests through actions.

cpelley commented 4 months ago

Through our experiments we have utilised all the bandwidth :(

image

I don't this would be ordinarily a problem so long as we don't clone the repository all the time. Having said that, I don't think we want to make the repository problematic to clones by 3rd parties.

Next week when our quota resets, I'll have a go at trying to change default behaviour so that it doesn't fetch the lfs data on checkout/clone.

Under a .lfsconfig file of the root:

[lfs]
    skip-smudge = true

or if skip-smudge isn't supported via the lfsconfig, then:

    fetchexclude = *
cpelley commented 3 months ago

Converted this to draft until some indeterminate date where implications can be better characterised and understood.

github-actions[bot] commented 1 month ago

In order to maintain a backlog of relevant PRs, we automatically label them as stale after 60 days of inactivity.

If this PR is still important to you, then please comment on this PR and the stale label will be removed.

Otherwise this PR will be automatically closed in 30 days time.

github-actions[bot] commented 3 weeks ago

This stale PR has been automatically closed due to a lack of activity.

If you still care about this PR, then please re-open this PR.