katanaml / sparrow

Data processing with ML, LLM and Vision LLM
https://katanaml.io
GNU General Public License v3.0
3.73k stars 379 forks source link

Cloning the repo takes time and .git folder contains 800MB #72

Closed telemmaiten closed 1 month ago

telemmaiten commented 2 months ago

Looks as very exciting project ! I noticed just cloning the repo takes minutes and the .git folder probably contains old references/commits to huge files that were later on removed from the repo.

:~/dev/sparrow (main)$ du -ch
8.0K    ./sparrow-data/ocr/.idea
8.0K    ./sparrow-data/ocr/routers
72K     ./sparrow-data/ocr
8.0K    ./sparrow-data/parse/.idea
44K     ./sparrow-data/parse/sparrow_parse/processors
20K     ./sparrow-data/parse/sparrow_parse/extractors
12K     ./sparrow-data/parse/sparrow_parse/vllm/infra/qwen2_vl_7b
16K     ./sparrow-data/parse/sparrow_parse/vllm/infra
36K     ./sparrow-data/parse/sparrow_parse/vllm
8.0K    ./sparrow-data/parse/sparrow_parse/data
24K     ./sparrow-data/parse/sparrow_parse/helpers
144K    ./sparrow-data/parse/sparrow_parse
172K    ./sparrow-data/parse
248K    ./sparrow-data
799M    ./.git/objects/pack
4.0K    ./.git/objects/info
799M    ./.git/objects
4.0K    ./.git/refs/tags
8.0K    ./.git/refs/remotes/origin
12K     ./.git/refs/remotes
8.0K    ./.git/refs/heads
28K     ./.git/refs
8.0K    ./.git/info
8.0K    ./.git/logs/refs/remotes/origin
12K     ./.git/logs/refs/remotes
8.0K    ./.git/logs/refs/heads
24K     ./.git/logs/refs
32K     ./.git/logs
4.0K    ./.git/branches
68K     ./.git/hooks
799M    ./.git
8.0K    ./sparrow-ml/llm/.idea
1.3M    ./sparrow-ml/llm/data
32K     ./sparrow-ml/llm/rag/agents/unstructured
8.0K    ./sparrow-ml/llm/rag/agents/instructor/helpers
28K     ./sparrow-ml/llm/rag/agents/instructor
28K     ./sparrow-ml/llm/rag/agents/llamaindex
16K     ./sparrow-ml/llm/rag/agents/haystack
8.0K    ./sparrow-ml/llm/rag/agents/sparrow_parse
120K    ./sparrow-ml/llm/rag/agents
124K    ./sparrow-ml/llm/rag
16K     ./sparrow-ml/llm/embeddings/agents
20K     ./sparrow-ml/llm/embeddings
1.6M    ./sparrow-ml/llm
1.6M    ./sparrow-ml
2.2M    ./sparrow-ui/assets
2.2M    ./sparrow-ui
804M    .
804M    total
abaranovskis-redsamurai commented 2 months ago

Hey, yes I'm aware about this problem. Project was developed for a long time, with refactoring. There were large data files previously, they are removed now. Old references still present in Git history and this is the reason it takes large size. Is there any option to clean unused references?

telemmaiten commented 2 months ago

I suggest to try this. https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html#purge-files-from-repository-history

head ./.git/filter-repo/analysis/*-{all,deleted}-sizes.txt
==> ./.git/filter-repo/analysis/directories-all-sizes.txt <==
=== All directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
  6839388818 3585859232 <present>  <toplevel>
  5452757886 2843412749 <present>  sparrow-data
  3774605787 1966360452 <present>  sparrow-data/docs
  1676964216  876678205 2024-01-30 sparrow-data/donut
  1676903914  876663922 2024-01-30 sparrow-data/donut/docs
  1214002397  609610696 2024-01-30 sparrow-data/donut/docs/input
  1214002397  609610696 <present>  sparrow-data/docs/input
   950283524  557734798 <present>  sparrow-data/docs/models/donut/data

==> ./.git/filter-repo/analysis/extensions-all-sizes.txt <==
=== All extensions by reverse size ===
Format: unpacked size, packed size, date deleted, extension name
  5205170676 3118906432 <present>  .jpg
   968886370  365093190 <present>  .pdf
    55894911   33104330 <present>  .ipynb
   550012649   27779066 2024-04-16 .json
    28728183   25665387 <present>  .png
    13335789   11528157 2024-01-30 .PDF
     4672587    2074503 2024-01-30 .csv
      733577     615142 <present>  .jpeg

==> ./.git/filter-repo/analysis/path-all-sizes.txt <==
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
     6274484    4727634 2022-10-06 sparrow-research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     6274484    4727634 2022-10-06 research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     6274484    4727634 <present>  app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     7615099    4023395 <present>  app/layoutlmv2_cord_prepare.ipynb
     4231342    3178543 2022-10-06 sparrow-research/app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     4231342    3178543 2022-10-06 research/app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     4231342    3178543 <present>  app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     3397206    3006437 2024-01-30 sparrow-ui/assets/annotation.png

==> ./.git/filter-repo/analysis/directories-deleted-sizes.txt <==
=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
  1676964216  876678205 2024-01-30 sparrow-data/donut
  1676903914  876663922 2024-01-30 sparrow-data/donut/docs
  1214002397  609610696 2024-01-30 sparrow-data/donut/docs/input
  1062412494  392868568 2023-02-25 sparrow-data/docs/invoices
   356479154  288680195 2024-01-30 sparrow-data/donut/docs/input/sroie
   356479154  288680195 2023-03-06 sparrow-data/docs/sroie
   354806855  287915033 2024-01-30 sparrow-data/donut/docs/input/sroie/img
   462901517  267053226 2024-01-30 sparrow-data/donut/docs/models/donut/data

==> ./.git/filter-repo/analysis/extensions-deleted-sizes.txt <==
=== Deleted extensions by reverse size ===
Format: unpacked size, packed size, date deleted, extension name
   550012649   27779066 2024-04-16 .json
    13335789   11528157 2024-01-30 .PDF
     4672587    2074503 2024-01-30 .csv
     3124209     331448 2024-01-30 .jsonl
     2023666     312489 2022-10-06 .js
      671280      46779 2024-01-30 .html
       79716       9839 2024-01-30 .ico
       18378       7550 2022-10-06 .svg

==> ./.git/filter-repo/analysis/path-deleted-sizes.txt <==
=== Deleted paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name(s)
     6274484    4727634 2022-10-06 sparrow-research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     6274484    4727634 2022-10-06 research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     4231342    3178543 2022-10-06 sparrow-research/app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     4231342    3178543 2022-10-06 research/app/Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb
     3397206    3006437 2024-01-30 sparrow-ui/assets/annotation.png
     3625188    2889724 2024-01-30 sparrow-data/donut/docs/input/sroie/img/525.jpg
     3625188    2889724 2024-01-30 sparrow-data/docs/sroie/img/525.jpg
     3625188    2889724 2024-01-30 sparrow-data/docs/input/sroie/img/525.jpg

Also this gives good overview

 git-sizer -v
Processing blobs: 12742
Processing trees: 2542
Processing commits: 825
Matching commits to trees: 825
Processing annotated tags: 0
Processing references: 10
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   825     |                                |
|   * Total size               |   427 KiB |                                |
| * Trees                      |           |                                |
|   * Count                    |  2.54 k   |                                |
|   * Total size               |  1.66 MiB |                                |
|   * Total tree entries       |  41.4 k   |                                |
| * Blobs                      |           |                                |
|   * Count                    |  12.7 k   |                                |
|   * Total size               |  1.47 GiB |                                |
| * Annotated tags             |           |                                |
|   * Count                    |     0     |                                |
| * References                 |           |                                |
|   * Count                    |    10     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  1.17 KiB |                                |
|   * Maximum parents      [2] |     2     |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.00 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  5.98 MiB |                                |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   773     |                                |
| * Maximum tag depth          |     0     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [5] |   176     |                                |
| * Maximum path depth     [6] |     9     |                                |
| * Maximum path length    [7] |   111 B   | *                              |
| * Number of files        [5] |  13.4 k   |                                |
| * Total size of files    [8] |  2.18 GiB | **                             |
| * Number of symlinks         |     0     |                                |
| * Number of submodules       |     0     |                                |

[1]  07ce96c19200e193821c54841e28a8137675201e
[2]  65284a42d7667612d948cf1cf8bd471807b1d6f5
[3]  d555f3d0fcec43792c512c8aa0d6864f9e2f4883 (af5628d301ea332b3db4144a4282322f553f69d3:sparrow-data/donut/docs/input/invoices/processed/images)
[4]  749bc1b2cd0956316bb1ecc54e63fab01afcf0ee (0f3862eb7d28deeb577c07afa9618f1d4fa01dc3:sparrow-research/app/True_Inference_with_LayoutLMv2ForTokenClassification_CORD.ipynb)
[5]  485c8357812eaa842d02a342ccfb46c1c99621df (2f9ed47e93268fdf51c37231ffcdb94332b90fc3^{tree})
[6]  85dd92b077376fabb27143625f748ed7ae5d0fb6 (af5628d301ea332b3db4144a4282322f553f69d3^{tree})
[7]  1bacc49592a9abe9e3d9b30fe26629cb3cde6fb4 (0f3862eb7d28deeb577c07afa9618f1d4fa01dc3^{tree})
[8]  e7fcc8264f378e53a28e428b063345c3798b00df (675889bb653f7a68bbc0d7b3812e2b624278a5f3^{tree})
abaranovskis-redsamurai commented 2 months ago

Thanks, I will give it a try.

abaranovskis-redsamurai commented 1 month ago

Garbage files are removed, repo clone now 7MB. Thanks for pointing out to this issue.