openpipelines-bio / openpipeline

https://openpipelines.bio
MIT License
25 stars 11 forks source link

Remove big binary files #622

Open rcannood opened 5 months ago

rcannood commented 5 months ago

If we use BFG to remove all blobs larger than 1M, we can reduce the openpipeline repo from 200MiB to around 44MiB. We can probably reduce it even further if we set the threshold even lower. @DriesSchaumont WDYT?

$  git clone --mirror git@github.com:openpipelines-bio/openpipeline.git lfs_test.git
Cloning into bare repository 'lfs_test.git'...
remote: Enumerating objects: 397073, done.
remote: Counting objects: 100% (6019/6019), done.
remote: Compressing objects: 100% (2307/2307), done.
remote: Total 397073 (delta 3644), reused 5873 (delta 3512), pack-reused 391054
Receiving objects: 100% (397073/397073), 200.99 MiB | 5.97 MiB/s, done.
Resolving deltas: 100% (269042/269042), done.

$ java -jar ~/Downloads/bfg-1.14.0.jar --strip-blobs-bigger-than 1M lfs_test.git

Using repo : /home/rcannood/workspace/openpipelines-bio/lfs_test.git

This repo has been processed by The BFG before! Will prune repo before proceeding - to avoid unnecessary cleaning work on unused objects...
Completed prune of old objects - will now proceed with the main job!

Scanning packfile for large blobs: 1588292
Scanning packfile for large blobs completed in 6,443 ms.
Found 6 blob ids for large blobs - biggest=14395908 smallest=1521437
Total size (unpacked)=47515450
Found 443 objects to protect
Found 512 commit-pointing refs : HEAD, refs/heads/481-add-leiden-clustering-to-scvi-pipeline, refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references, ...
Found 4 tag-pointing refs : refs/tags/0.3.0, refs/tags/0.3.1, refs/tags/0.4.0, refs/tags/0.4.1

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 5fb2a9e0 (protected by 'HEAD')

Cleaning
--------

Found 4459 commits
Cleaning commits:       100% (4459/4459)
Cleaning commits completed in 3,003 ms.

Updating 156 Refs
-----------------

    Ref                                                                          Before     After   
    ------------------------------------------------------------------------------------------------
    refs/heads/481-add-leiden-clustering-to-scvi-pipeline                      | 700bffd6 | 6d0b9eec
    refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references | 772769ee | 7abac021
    refs/heads/604-use-the-viash-dependencies-config-value-for-workflows       | 843009e8 | 8b7b78ba
    refs/heads/concat_dtypes                                                   | c8f1e5f8 | e92cbea4
    refs/heads/feature/ataq-demux                                              | 5dcebba7 | 1666af0f
    refs/heads/feature/ataq-qc                                                 | dde357ff | 98d64cbd
    refs/heads/feature/scpoli_implementation                                   | b17c3a84 | 3ee6bc23
    refs/heads/increase_ci_memory                                              | 1464e7aa | 9b6af876
    refs/heads/integration_build                                               | b225d951 | d1eaab7b
    refs/heads/main                                                            | 5fb2a9e0 | 56ac0431
    refs/heads/main_build                                                      | 8a9894a6 | cc0001cd
    refs/heads/main_build_datasets_schema                                      | 5022c403 | 901839ca
    refs/heads/more_memory_tests                                               | fe5188fa | 7608da95
    refs/heads/release                                                         | 98678513 | 0594ac36
    refs/heads/review_cellxgene                                                | f881710c | 475cecfc
    ...

Updating references:    100% (156/156)
...Ref update completed in 38 ms.

Commit Tree-Dirt History
------------------------

    Earliest                                              Latest
    |                                                          |
    ..............................................DDDDDDDDDmmDmm

    D = dirty commits (file tree fixed)
    m = modified commits (commit message or parents changed)
    . = clean commits (no changes to file tree)

                            Before     After   
    -------------------------------------------
    First modified commit | 6455c1d6 | fae7b4ab
    Last dirty commit     | e27f9172 | 3ffb155c

Deleted files
-------------

    Filename                                                                  Git id            
    --------------------------------------------------------------------------------------------
    cellranger-tiny-bcl-1.2.0.tar.gz                                        | 4b3e7995 (13.4 MB)
    cl-base.obo                                                             | af96cc47 (1.5 MB) 
    matrix.mtx.gz                                                           | 9e469be2 (4.0 MB) 
    pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5                        | eade8772 (5.2 MB) 
    pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5ad                      | 145b611c (13.7 MB)
    pbmc_1k_protein_v3_filtered_feature_bc_matrix.norm.hvg.pca.nn.umap.h5ad | de2901dd (7.6 MB) 

In total, 22327 object ids were changed. Full details are logged here:

    /home/rcannood/workspace/openpipelines-bio/lfs_test.git.bfg-report/2023-11-24/14-40-05

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

$ cd lfs_test.git

$ git reflog expire --expire=now --all && git gc --prune=now --aggressive
Enumerating objects: 397073, done.
Counting objects: 100% (397073/397073), done.
Delta compression using up to 32 threads
Compressing objects: 100% (379869/379869), done.
Writing objects: 100% (397073/397073), done.
Selecting bitmap commits: 4368, done.
Building bitmaps: 100% (148/148), done.
Total 397073 (delta 268875), reused 124073 (delta 0), pack-reused 0

$ git push
Enumerating objects: 397073, done.
Writing objects: 100% (397073/397073), 44.70 MiB | 3.69 MiB/s, done.
Total 397073 (delta 0), reused 0 (delta 0), pack-reused 397073
remote: Resolving deltas: 100% (268875/268875), done.
rcannood commented 5 months ago

If I set the threshold to 500K, I get:

$ java -jar ~/Downloads/bfg-1.14.0.jar --strip-blobs-bigger-than 200K lfs_test.git
Using repo : /home/rcannood/workspace/openpipelines-bio/lfs_test.git

This repo has been processed by The BFG before! Will prune repo before proceeding - to avoid unnecessary cleaning work on unused objects...
Completed prune of old objects - will now proceed with the main job!

Scanning packfile for large blobs: 794146
Scanning packfile for large blobs completed in 2,581 ms.
Found 2891 blob ids for large blobs - biggest=715168 smallest=216802
Total size (unpacked)=53673113
Found 443 objects to protect
Found 512 commit-pointing refs : HEAD, refs/heads/481-add-leiden-clustering-to-scvi-pipeline, refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references, ...
Found 4 tag-pointing refs : refs/tags/0.3.0, refs/tags/0.3.1, refs/tags/0.4.0, refs/tags/0.4.1

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 56ac0431 (protected by 'HEAD') - contains 3 dirty files : 
    - images/concepts/fig.svg (389.1 KB)
    - src/mapping/bd_rhapsody/rhapsody_targeted_1.10.1_nodocker.cwl (211.7 KB)
    - src/mapping/bd_rhapsody/rhapsody_wta_1.10.1_nodocker.cwl (212.8 KB)

WARNING: The dirty content above may be removed from other commits, but as
the *protected* commits still use it, it will STILL exist in your repository.

Details of protected dirty content have been recorded here :

/home/rcannood/workspace/openpipelines-bio/lfs_test.git.bfg-report/2023-11-24/14-49-03/protected-dirt/

If you *really* want this content gone, make a manual commit that removes it,
and then run the BFG on a fresh copy of your repo.

Cleaning
--------

Found 4459 commits
Cleaning commits:       100% (4459/4459)
Cleaning commits completed in 2,481 ms.

Updating 514 Refs
-----------------

    Ref                                                                          Before     After   
    ------------------------------------------------------------------------------------------------
    refs/heads/481-add-leiden-clustering-to-scvi-pipeline                      | 6d0b9eec | eb966355
    refs/heads/590-clusterleiden-config-contains-incorrect-markdown-references | 7abac021 | 56f76331
    refs/heads/604-use-the-viash-dependencies-config-value-for-workflows       | 8b7b78ba | 75caa7e9
    refs/heads/automation                                                      | 9cd06207 | b857a87a
    refs/heads/concat_dtypes                                                   | e92cbea4 | 21942a5e
    refs/heads/feature/ataq-demux                                              | 1666af0f | 3792b762
    refs/heads/feature/ataq-qc                                                 | 98d64cbd | de89e1b7
    refs/heads/feature/cellranger_convert                                      | 951b5c99 | e43c8791
    refs/heads/feature/count_demultiplexing                                    | 6461edd3 | d5e1bd2f
    refs/heads/feature/refactor_velocyto                                       | 068ed30d | 76440a30
    refs/heads/feature/scpoli_implementation                                   | 3ee6bc23 | 7a4cbf9c
    refs/heads/feature/ts                                                      | bfd45792 | ddb86b6d
    refs/heads/fix_temp_var                                                    | b52db6ef | 9fb67514
    refs/heads/increase_ci_memory                                              | 9b6af876 | 299ad45f
    refs/heads/integration_build                                               | d1eaab7b | 0373d7e3
    ...

Updating references:    100% (514/514)
...Ref update completed in 117 ms.

Commit Tree-Dirt History
------------------------

    Earliest                                              Latest
    |                                                          |
    .DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

    D = dirty commits (file tree fixed)
    m = modified commits (commit message or parents changed)
    . = clean commits (no changes to file tree)

                            Before     After   
    -------------------------------------------
    First modified commit | a2af1a87 | 89e209f5
    Last dirty commit     | e1ddf9cd | b942e054

Deleted files
-------------

    Filename                                        Git id                                       
    ---------------------------------------------------------------------------------------------
    CS0000007_subsample_LI00080.csv.gz            | 449a7f3a (343.0 KB)                          
    features.tsv.gz                               | 1288f445 (297.6 KB)                          
    fig.svg                                       | 2a72f8e7 (389.1 KB)                          
    main.nf                                       | cfd3ebb5 (213.1 KB), 732e783e (252.9 KB), ...
    multi_star                                    | 76d7c752 (337.8 KB), b87f789d (335.4 KB), ...
    pbmc_1k_protein_v3_raw_feature_bc_matrix.h5   | 0d3a7789 (274.6 KB)                          
    pbmc_1k_protein_v3_raw_feature_bc_matrix.h5ad | 62aa4349 (698.4 KB)                          
    pipelines-target-p1.png                       | 1f658205 (292.0 KB), 5dc0174c (292.0 KB)     
    pipelines-target-p2.png                       | d9a7235a (300.7 KB), 55690133 (300.7 KB)     
    pipelines-target-p3.png                       | ec2cf53b (250.2 KB), ac65760d (245.8 KB), ...
    pipelines.svg                                 | 19ee6521 (278.9 KB), 16d12ddb (289.1 KB)     
    rhapsody_targeted_1.10.1_nodocker.cwl         | 56a6310b (211.7 KB)                          
    rhapsody_wta_1.10.1_nodocker.cwl              | 5fa9ea85 (212.8 KB)                          
    rhapsody_wta_1.10_nodocker.cwl                | c941c763 (212.3 KB)                          
    star_align                                    | 0df72d36 (308.0 KB), 4a1e589e (307.9 KB), ...
    star_align_v273a                              | e9182424 (308.3 KB), 39258580 (308.4 KB), ...

In total, 18874 object ids were changed. Full details are logged here:

    /home/rcannood/workspace/openpipelines-bio/lfs_test.git.bfg-report/2023-11-24/14-49-03
DriesSchaumont commented 5 months ago

Does this edit the repository retroactively? If so, we should make sure to exclude release, main, main_build and all tags. Otherwise we could break older releases/runs. Is there a problem with having a large repo?