Closed telemmaiten closed 1 month ago
Hey, yes I'm aware about this problem. Project was developed for a long time, with refactoring. There were large data files previously, they are removed now. Old references still present in Git history and this is the reason it takes large size. Is there any option to clean unused references?
I suggest to try this. https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html#purge-files-from-repository-history
head ./.git/filter-repo/analysis/*-{all,deleted}-sizes.txt
==> ./.git/filter-repo/analysis/directories-all-sizes.txt <==
=== All directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
6839388818 3585859232 <present> <toplevel>
5452757886 2843412749 <present> sparrow-data
3774605787 1966360452 <present> sparrow-data/docs
1676964216 876678205 2024-01-30 sparrow-data/donut
1676903914 876663922 2024-01-30 sparrow-data/donut/docs
1214002397 609610696 2024-01-30 sparrow-data/donut/docs/input
1214002397 609610696 <present> sparrow-data/docs/input
950283524 557734798 <present> sparrow-data/docs/models/donut/data
==> ./.git/filter-repo/analysis/extensions-all-sizes.txt <==
=== All extensions by reverse size ===
Format: unpacked size, packed size, date deleted, extension name
5205170676 3118906432 <present> .jpg
968886370 365093190 <present> .pdf
55894911 33104330 <present> .ipynb
550012649 27779066 2024-04-16 .json
28728183 25665387 <present> .png
13335789 11528157 2024-01-30 .PDF
4672587 2074503 2024-01-30 .csv
733577 615142 <present> .jpeg
==> ./.git/filter-repo/analysis/path-all-sizes.txt <==
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
6274484 4727634 2022-10-06 sparrow-research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
6274484 4727634 2022-10-06 research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
6274484 4727634 <present> app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
7615099 4023395 <present> app/layoutlmv2_cord_prepare.ipynb
4231342 3178543 2022-10-06 sparrow-research/app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
4231342 3178543 2022-10-06 research/app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
4231342 3178543 <present> app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
3397206 3006437 2024-01-30 sparrow-ui/assets/annotation.png
==> ./.git/filter-repo/analysis/directories-deleted-sizes.txt <==
=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
1676964216 876678205 2024-01-30 sparrow-data/donut
1676903914 876663922 2024-01-30 sparrow-data/donut/docs
1214002397 609610696 2024-01-30 sparrow-data/donut/docs/input
1062412494 392868568 2023-02-25 sparrow-data/docs/invoices
356479154 288680195 2024-01-30 sparrow-data/donut/docs/input/sroie
356479154 288680195 2023-03-06 sparrow-data/docs/sroie
354806855 287915033 2024-01-30 sparrow-data/donut/docs/input/sroie/img
462901517 267053226 2024-01-30 sparrow-data/donut/docs/models/donut/data
==> ./.git/filter-repo/analysis/extensions-deleted-sizes.txt <==
=== Deleted extensions by reverse size ===
Format: unpacked size, packed size, date deleted, extension name
550012649 27779066 2024-04-16 .json
13335789 11528157 2024-01-30 .PDF
4672587 2074503 2024-01-30 .csv
3124209 331448 2024-01-30 .jsonl
2023666 312489 2022-10-06 .js
671280 46779 2024-01-30 .html
79716 9839 2024-01-30 .ico
18378 7550 2022-10-06 .svg
==> ./.git/filter-repo/analysis/path-deleted-sizes.txt <==
=== Deleted paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name(s)
6274484 4727634 2022-10-06 sparrow-research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
6274484 4727634 2022-10-06 research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
4231342 3178543 2022-10-06 sparrow-research/app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
4231342 3178543 2022-10-06 research/app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
3397206 3006437 2024-01-30 sparrow-ui/assets/annotation.png
3625188 2889724 2024-01-30 sparrow-data/donut/docs/input/sroie/img/525.jpg
3625188 2889724 2024-01-30 sparrow-data/docs/sroie/img/525.jpg
3625188 2889724 2024-01-30 sparrow-data/docs/input/sroie/img/525.jpg
Also this gives good overview
git-sizer -v
Processing blobs: 12742
Processing trees: 2542
Processing commits: 825
Matching commits to trees: 825
Processing annotated tags: 0
Processing references: 10
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Commits | | |
| * Count | 825 | |
| * Total size | 427 KiB | |
| * Trees | | |
| * Count | 2.54 k | |
| * Total size | 1.66 MiB | |
| * Total tree entries | 41.4 k | |
| * Blobs | | |
| * Count | 12.7 k | |
| * Total size | 1.47 GiB | |
| * Annotated tags | | |
| * Count | 0 | |
| * References | | |
| * Count | 10 | |
| | | |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 1.17 KiB | |
| * Maximum parents [2] | 2 | |
| * Trees | | |
| * Maximum entries [3] | 1.00 k | * |
| * Blobs | | |
| * Maximum size [4] | 5.98 MiB | |
| | | |
| History structure | | |
| * Maximum history depth | 773 | |
| * Maximum tag depth | 0 | |
| | | |
| Biggest checkouts | | |
| * Number of directories [5] | 176 | |
| * Maximum path depth [6] | 9 | |
| * Maximum path length [7] | 111 B | * |
| * Number of files [5] | 13.4 k | |
| * Total size of files [8] | 2.18 GiB | ** |
| * Number of symlinks | 0 | |
| * Number of submodules | 0 | |
[1] 07ce96c19200e193821c54841e28a8137675201e
[2] 65284a42d7667612d948cf1cf8bd471807b1d6f5
[3] d555f3d0fcec43792c512c8aa0d6864f9e2f4883 (af5628d301ea332b3db4144a4282322f553f69d3:sparrow-data/donut/docs/input/invoices/processed/images)
[4] 749bc1b2cd0956316bb1ecc54e63fab01afcf0ee (0f3862eb7d28deeb577c07afa9618f1d4fa01dc3:sparrow-research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb)
[5] 485c8357812eaa842d02a342ccfb46c1c99621df (2f9ed47e93268fdf51c37231ffcdb94332b90fc3^{tree})
[6] 85dd92b077376fabb27143625f748ed7ae5d0fb6 (af5628d301ea332b3db4144a4282322f553f69d3^{tree})
[7] 1bacc49592a9abe9e3d9b30fe26629cb3cde6fb4 (0f3862eb7d28deeb577c07afa9618f1d4fa01dc3^{tree})
[8] e7fcc8264f378e53a28e428b063345c3798b00df (675889bb653f7a68bbc0d7b3812e2b624278a5f3^{tree})
Thanks, I will give it a try.
Garbage files are removed, repo clone now 7MB. Thanks for pointing out to this issue.
Looks as very exciting project ! I noticed just cloning the repo takes minutes and the .git folder probably contains old references/commits to huge files that were later on removed from the repo.