Open fuse314 opened 1 year ago
I sent a note to this effect as well. reducing the repo to JUST the needful results in a collection which consumes about 26MB.
what I propose is creating a new 'blowfish core' as an isolated repo minus the examplesite, public, dirs... and the related git history and importing that... I'd proposed it as a thing that would coincide with the breaking change in I proposed here: https://github.com/nunocoracao/blowfish/discussions/936 seeing as implementing this would require a major version bump anyways, it felt like a "good" time to do something drastic like this... but yeah.... >700mb to 20mb is... substantially different performance-wise in the ci universe
+1 for reducing the repo. It's really the pain, 700mb for theme!
+1 for optimizing the repo's size, especially since the upstream version in Congo is <40 MB whereas Blowfish's is nearly 750MB.
Basically all of the bloat is due to this project's additions to exampleSite
and public
.
EDIT (2024-07): A sparse clone that excludes these extraneous files takes <2 MB of network activity, <1 second to clone/checkout, and only a few MB of disk usage.
git clone --filter=blob:none --no-checkout --depth=1 --sparse https://github.com/nunocoracao/blowfish.git
cd blowfish
printf '/*\n!exampleSite/*\n!images/*\n!assets/img/*\n!blowfish_logo.png' > .git/info/sparse-checkout
git checkout
It would be a better approach to transfer exampleSite
and public
to a new repo.
technically false… because the .git directory/history remains in the repo, and that’s >half the problem.
extracting the core to a new repo would be the least painful way to address that.
git repo surgery is a PITA.
it’s a breaking change almost however you cut it, as any downstream repo / clone will encounter problems with history being rewritten
Hence why I was proposing it coincide with my other breaking change of robustifying the authors construct..
Rewriting git history isn't too complicated, especially if you're removing whole directories, but yeah it would probably be better to move or break off the main theme files elsewhere to avoid breaking everyone else's copies.
You could even maintain stars and such by just keeping this one as the main repo, but with a note that you can clone another one for just the theme alone.
keeping this one as the main repo, but with a note that you can clone another one for just the theme alone.
A 1:1 repo like blowfish-lite
that would require @nunocoracao to maintain two copies of the same codebase… more toil, greater chance for divergence.
what feels least bullshitty to me is having a core repo that has the theme content which this repo consumes…. but wth do i know? :)
that would require @nunocoracao to maintain two copies of the same codebase… more toil, greater chance for divergence.
what feels least bullshitty to me is having a core repo that has the theme content which this repo consumes
I was thinking something similar, but even if they were completely separate, you could lazily sync the theme into the main repo on every commit with a GitHub action.
implementation details :) lots of ways to accomplish it… each has its own bag of bullshit… i don’t wanna be too prescriptive of the how… just wanting to amplify the legitimacy of the request
Hey @fuse314 @wolfspyre @ragibson @chromer030 Thanks for all the feedback. I am definitely interested in improving this ASAP. From the thread I got a couple of actions:
Small update - with the last changes already reduced the repo size from 736M to 546M (25% reduction). Will keep exploring the other two options a little more before committing to a solutions.
@nunocoracao if you wanna take a look at what I'm doing async:
I mirror this repo:
https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish
The scripts here rip the history around quite a bit... might be useful to play with: https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish-wrangler
The output: https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish-thin
/tmp$ git clone https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish-thin
Cloning into 'blowfish-thin'...
warning: redirecting to https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish-thin.git/
remote: Enumerating objects: 5704, done.
remote: Counting objects: 100% (2322/2322), done.
remote: Compressing objects: 100% (1489/1489), done.
remote: Total 5704 (delta 831), reused 2322 (delta 831), pack-reused 3382
Receiving objects: 100% (5704/5704), 36.85 MiB | 38.12 MiB/s, done.
Resolving deltas: 100% (2997/2997), done.
/tmp$ du -sh blowfish-thin/
42M blowfish-thin/
/tmp$ cd blowfish-thin
/tmp/blowfish-thin (wpl_main)$ ls
CODE_OF_CONDUCT.md README.md config.toml i18n package-lock.json tailwind.config.js
CONTRIBUTING.md archetypes data layouts package.json theme.toml
FUNDING.yml assets firebase.json lighthouserc.js processUsers.js
LICENSE config go.mod netlify.toml static
/tmp/blowfish-thin (wpl_main)$ rm -rf .git
/tmp/blowfish-thin$ du -sh .
4.4M .
versus:
/tmp$ git clone https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish
Cloning into 'blowfish'...
warning: redirecting to https://gitlab.wolfspyre.io/mirrored_repos/hugo_related/nunocoracao/blowfish.git/
remote: Enumerating objects: 23670, done.
remote: Total 23670 (delta 0), reused 0 (delta 0), pack-reused 23670
Receiving objects: 100% (23670/23670), 358.79 MiB | 40.04 MiB/s, done.
Resolving deltas: 100% (11428/11428), done.
Updating files: 100% (1864/1864), done.
/tmp$ du -sh blowfish/
815M blowfish/
/tmp$ cd blowfish
/tmp/blowfish
/tmp/blowfish (wpl_main)$ rm -rf .git
/tmp/blowfish$ du -sh .
445M .
I'm not gonna assert this is perfect :) it's kludgey... and a bit brittle... but it works for the moment.... an it might be helpful as exploratory POC. Will reach out in email.
@wolfspyre checked your solution but It seems that it's not possible to change history and use the same git repo as things become incompatible right?
@nunocoracao it’s possible for sure, but every path comes with caveats… when you rewrite history, it invalidates others’ versions of it.. :)
so it’s something that needs to be done in a coordinated and clear fashion..
the least messy way forward (IMO) is likely a clean repo for ‘BLOWFISH CORE’ or something that blowfish imports… saves you the hassle of rewriting history … gives you a blank slate for tomorrow… keeps existing repo around… (shouldn’t ) make anyone adjust their existing tooling
but that then requires the plumbing which slurps blowfish core into blowfish anytime core changes…
alternatively,
this would mean anyone consuming the repo would have to manually twiddle git, as history changed…
something like this would coincide well with a major version change…
people that don’t want to follow along can switch their upstream from blowfish to blowfish-legacy and be insulated from any breaking changes
there’s many ways forward, each comes with some nuance and sticky spots…
there’s a few REALLY BAD IDEA ways forward, but barring those, most of the options are viable for a given set of constraints… which makes the most sense depends on your unique needs/preferences as much as the technical requirements/limitations yknow?
@wolfspyre not really comfortable with messing with the git history. Meanwhile, trimmed it down again from 553mb to 460mb by reducing image sizes
Is it worth looking at using .webp image formats over .jpg and .png? Ive started implementing this on my own site and works great.
Is it worth looking at using .webp image formats over .jpg and .png? Ive started implementing this on my own site and works great.
I'd note that the old versions of the images will still be stored in .git's bookkeeping, but that would presumably help on shallow clones.
@fuse314 @wolfspyre Speaking of which, in a CI environment you should probably be cloning with --depth=1
to only include history truncated to the most recent commit. That'll cut the current repo size from ~470MB to ~145MB.
something to keep in mind.... git history keeps EVERYTHING that EVER was in the repo.... every time you convert an asset, you're increasing the repo size by that much...
This is part of the reason why having the public version of the site in the repo is problematic; due to the asset hashing/ fingerprinting, every new version has almost a full copy of the public site and all its' images...
/tmp$ git clone https://github.com/nunocoracao/blowfish.git
Cloning into 'blowfish'...
remote: Enumerating objects: 24053, done.
remote: Counting objects: 100% (24053/24053), done.
remote: Compressing objects: 100% (10846/10846), done.
remote: Total 24053 (delta 11552), reused 23709 (delta 11441), pack-reused 0
Receiving objects: 100% (24053/24053), 379.26 MiB | 20.45 MiB/s, done.
Resolving deltas: 100% (11552/11552), done.
/tmp$
/tmp$ du -sh blowfish/; cd blowfish
457M blowfish/
/tmp/blowfish (main)$
/tmp/blowfish (main)$ git-filter-repo --analyze
Processed 14043 blob sizes
Processed 1505 commits
Writing reports to .git/filter-repo/analysis...done.
/tmp/blowfish (main)$
/tmp/blowfish (main)$ head -13 .git/filter-repo/analysis/directories-all-sizes.txt
=== All directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
1109829186 860493241 <present> <toplevel>
549038949 513229890 <present> exampleSite
279498292 268756950 <present> exampleSite/content
174077004 165795885 2023-10-15 public
162893073 155468174 2023-10-15 exampleSite/resources/_gen/images
162893073 155468174 2023-10-15 exampleSite/resources/_gen
162893073 155468174 2023-10-15 exampleSite/resources
131061836 124922519 <present> exampleSite/content/docs
113300504 110829169 2023-10-15 public/docs
103408455 102654615 2023-10-15 exampleSite/resources/_gen/images/docs
100530759 58280398 2022-10-02 docs
/tmp/blowfish (main)$
/tmp/blowfish (main)$ head -10 .git/filter-repo/analysis/directories-deleted-sizes.txt
=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
174077004 165795885 2023-10-15 public
162893073 155468174 2023-10-15 exampleSite/resources/_gen/images
162893073 155468174 2023-10-15 exampleSite/resources/_gen
162893073 155468174 2023-10-15 exampleSite/resources
113300504 110829169 2023-10-15 public/docs
103408455 102654615 2023-10-15 exampleSite/resources/_gen/images/docs
100530759 58280398 2022-10-02 docs
46014364 35717600 2022-09-12 exampleSite/docs
/tmp/blowfish (main)$
/tmp/blowfish (main)$ rm -rf .git
/tmp/blowfish$ du -sh .
72M .
Yes, webp is marginally better compression-wise, but if stuff's already reasonably compressed /sized, the full replication of the asset in history likely outweighs any gains in asset size from compression...
now, for NEW assets, certainly worth exploring, but I leave that decision to @nunocoracao ;)
@ragibson
my ci already is doing so, (plus I haz local mirror of repo so its less of an issue (FOR ME) but that's beside the point of curbing the bloat, which @nunocoracao 's already substantially impacted ( <3 ) rolling forward... now it's a simple question of where else the juice is worth the squeeze ;)
Dumb question probably. Is there any way you "manipulate" the git info and still use this same repo?
Dumb question probably. Is there any way you "manipulate" the git info and still use this same repo?
So, it's really more that rewriting git history will break anyone else's checkout/clone of the repo, though it is easy enough to fix on their end for an experienced user by pruning the git repo or simply recloning.
You could mess with history all you want and still use this repo IF that were an acceptable result (it's probably not).
One other strategy is to clean up the project and then use git replace to transparently ignore older files from the git history unless they are absolutely needed. That does not rewrite history and would reduce the size of git clones tremendously, but it is definitely a more advanced operation and comes with its own series of gotchas. See something like https://stackoverflow.com/a/17622991 for more details.
Thanks @ragibson super appreciate the help. And also sorry everyone, this is mainly due to me including a bunch of needed folders initially in the repo which were deleted several versions ago.
@ragibson is there a safe way for me to test these solutions in a separate repo - e.g. forking Blowfish and then trying out these git operations in it? One of my concerns is the risk involved in f-ing this up for everyone.
Sure -- I don't think GitHub will let you fork your own repo, but you can try
You're right that I wouldn't recommend experimenting on the production repository itself
On Thu, Oct 26, 2023, 2:45 PM Nuno Coração @.***> wrote:
Thanks @ragibson https://github.com/ragibson super appreciate the help. And also sorry everyone, this is mainly due to me including a bunch of needed folders initially in the repo which were deleted several versions ago.
@ragibson https://github.com/ragibson is there a safe way for me to test these solutions in a separate repo - e.g. forking Blowfish and then trying out these git operations in it?
— Reply to this email directly, view it on GitHub https://github.com/nunocoracao/blowfish/issues/980#issuecomment-1781687388, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK7WIHDOI2LEPUKMBTPUKDYBKVULAVCNFSM6AAAAAA5OTBHTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBRGY4DOMZYHA . You are receiving this because you were mentioned.Message ID: @.***>
I think you can just make a new Repo, let's call it "newtestrepo", on Github.
Clone this blowfish repo to a local folder "newtestrepo".
Change the git remote url to the newtestrepo git remote set-url origin https://github.com/user/newtestrepo.git
I don't know if the .github
folder does unexpected things, or if you have to set up any processes manually...
Push the repo to the new location with git push origin
.
Then, rewrite the history and git push origin
(probably with the "force" option) the changes to the test repo.
Freshly clone the test repo into another local folder and check the new folder size.
While this doesn't address the root issue, my current workaround for the CI pipeline is to simply download Blowfish's latest release archive. The archive does include exampleSite/
, but not the public/
dir and git history.
For the latest release (v2.44.0), it's a 67mb download and 72mb unarchived. The pipeline step only takes 3s on a default GitHub runner.
curl -o blowfish.zip -L $(curl -s https://api.github.com/repos/nunocoracao/blowfish/releases/latest | jq -r '.tarball_url')
tar --one-top-level=themes/blowfish --strip-components=1 -xzf blowfish.zip
EDIT: I just realized the downloaded archive probably gets included in the deployment to e.g. Firebase Hosting (unless you ignore it in firebase.json
). So you'll need to either delete it after unarchiving, or better yet here's a one-liner that doesn't write it to disk:
tar --one-top-level=themes/blowfish --strip-components=1 -xzf <(curl -Ls $(curl -s https://api.github.com/repos/nunocoracao/blowfish/releases/latest | jq -r '.tarball_url'))
I'm probably late to the discussion but I'd suggest cloning blowfish as a shallow submodule rather than directly. Consider the following repo where I also use blowfish. You can clone it with the modules using --recurse-submodules
:
git clone https://github.com/madoke/madoke.org.git blowfish-test --recurse-submodules 5089 23:31:55
Cloning into 'blowfish-test'...
remote: Enumerating objects: 1461, done.
remote: Counting objects: 100% (713/713), done.
remote: Compressing objects: 100% (370/370), done.
remote: Total 1461 (delta 283), reused 676 (delta 252), pack-reused 748
Receiving objects: 100% (1461/1461), 36.68 MiB | 10.11 MiB/s, done.
Merge branch 'main' of github.com:madoke/madoke.org
Resolving deltas: 100% (413/413), done.
Submodule 'themes/blowfish' (https://github.com/madoke/blowfish) registered for path 'themes/blowfish'
Cloning into '/Users/madoke/work/blowfish-test/themes/blowfish'...
remote: Enumerating objects: 17536, done.
remote: Counting objects: 100% (1209/1209), done.
remote: Compressing objects: 100% (547/547), done.
remote: Total 17536 (delta 676), reused 1139 (delta 634), pack-reused 16327
Receiving objects: 100% (17536/17536), 373.44 MiB | 19.53 MiB/s, done.
Resolving deltas: 100% (9459/9459), done.
Submodule path 'themes/blowfish': checked out '96cbca1d4d2ce7dddbdae5ea940d749aa16929a6'
Checking the size reveals that the latest version of blowfish takes only 73M, which I guess is already significantly small due to previous efforts:
du -sh blowfish-test/themes/ 5090 23:32:27
73M blowfish-test/themes/
The key thing here is that the .git
folder containing the history is not pulled entirely, as we can see this one takes 4K while cloning blowfish directly will pull the entire history which takes 400M+
du -sh blowfish-test/themes/blowfish/.git 5091 23:32:43
4.0K blowfish-test/themes/blowfish/.git
Hope this helps anyone !
Not sure if it's easier to just use hugo module instead so we don't need to deal with the .git
?
Small migration work will be needed for users of course.
current submodule:
--- /private/tmp/mynewsite ------------------------------
148.4 MiB [##################################] /themes
81.2 MiB [################## ] /.git
28.0 KiB [ ] /config
4.0 KiB [ ] /archetypes
4.0 KiB [ ] .gitmodules
4.0 KiB [ ] hugo.toml
(...omitted)
hugo mod (gathered with $ hugo config | grep cachedir
)
--- /Users/<redacted>/Library/Caches/hugo_cache/modules/filecache/modules/pkg/mod/github.com/nunocoracao/blowfish/v2@v2.72.1 -----
/..
93.3 MiB [##################################] /assets
47.2 MiB [################# ] /exampleSite
6.2 MiB [## ] /images
520.0 KiB [ ] blowfish_logo.png
480.0 KiB [ ] /layouts
(...omitted)
To reduce the size further:
exampleSite
like mentioned by many others abovemermaid
in the assets/lib/
takes 89.8 MiB. After removing *.js.map
source maps it comes down to 55.3 MiB. Haven't check how the packages are pulled but should be rooms for improvements? (we can of course just leverage CDN but I personally preferred the self-contained way for assets)2. `mermaid` in the `assets/lib/` takes 89.8 MiB.
I just noticed this same thing --
lib/mermaid
takes up ~96% of the entire assets folder. Was it bundled incorrectly?@fuse314 @wolfspyre Speaking of which, in a CI environment you should probably be cloning with
--depth=1
to only include history truncated to the most recent commit. That'll cut the current repo size from ~470MB to ~145MB.
Comparing to my comment last October, the full repo size has increased to ~623 MB with a depth=1
clone being ~220 MB.
Okay yes, something's definitely not bundled right. The theme blowfish is forked from (Congo) has lib/mermaid
only taking up ~3 MB.
Anyway, a sparse clone that excludes the major extraneous files takes <2 MB of network activity, <1 second to clone/checkout, and only a few MB of disk usage.
git clone --filter=blob:none --no-checkout --depth=1 --sparse https://github.com/nunocoracao/blowfish.git
cd blowfish
printf '/*\n!assets/lib/mermaid/*\n!exampleSite/*\n!images/*\n!assets/img/*\n!blowfish_logo.png' > .git/info/sparse-checkout
git checkout
What I'm doing here is by omitting these extraneous directories:
assets/lib/mermaid/
(~90 MB)
exampleSite/
(~48 MB)images/
(~6 MB)assets/img
(~1 MB)blowfish_logo.png
(~0.5 MB)The final size is ~3 MB plus ~2 MB of git bookkeeping, and I was able to successfully build the author's site and my own site with this configuration, so it seems fine.
$ ncdu .
2.6 MiB [##########] /assets
2.0 MiB [####### ] /.git
544.0 KiB [## ] /layouts
168.0 KiB [ ] package-lock.json
...
To continue the discussion a little bit:
After #1588, we have exampleSite
folder left to tackle. I know @nunocoracao mentioned in https://github.com/nunocoracao/blowfish/issues/980#issuecomment-1763463347 that he wanted to see whether it's possible to keep the folder but make it much smaller. Considering there will be more and more users add their site to the list, the folder will still grow over time due to the thumbnails of each site.
So just like OP and others have mentioned, it might be better to create a separate repo to host exampleSite
, like:
git rm
the exampleSite
in https://github.com/nunocoracao/blowfish/exampleSite
as a git submodule (not sure the use case here)Either way needs @nunocoracao to create a repo first and then I can make some PRs for:
dev
branch https://github.com/nunocoracao/blowfish/blob/c6c2689195eb47241c613a1e4b8d0828ebf3c6e4/exampleSite/content/users/_index.md?plain=1#L22By then we will have a hugo module at roughly 14MiB. However, for people using git submodule might still need to deal with the much larger .git
folder with tricks above. That's also the reason I personally recommend hugo module instead.
Hey everyone, I haven’t noticed anyone talking about the images here. I found out, that you can compress the images (at least png-images) with pngquant:
Compress a specific PNG as much as possible and write the result to a new file:
pngquant path/to/file.png
And the results are, I compressed the images directory, it went from 6.17 MB to 2.47 MB. In my projects I use almost exclusively avif (or webp sometimes), and all the same images in avif are just 349 KB. Visually, I see no difference between compressed images and their original counterparts.
Possibly, it’s reasonable for CI to exclude everything unnecessary (as said before, public, exampleSite etc.). So I’m just saying. I may PR this a bit later, but I don’t know what is better, to have the same compressed images or just another format in here (avif or webp).
Hey everyone, I haven’t noticed anyone talking about the images here. I found out, that you can compress the images
The problem with this is it actually increases the size of the repo since you're now tracking both the compressed and original versions.
I.e., that only helps with a shallow clone, so you might as well exclude all images in a sparse checkout anyway.
As someone who maintains a fork of Blowfish (due to some deep level changes I need for my website), I don't think changing the Git history would be ideal since it would be a bit of an issue for downstream forks.
Describe the bug My CI script fetches the submodule (blowfish theme) every time the build process is run. The repo size is over 700MB, mostly due to the folders
.git
,exampleSite
andpublic
To Reproduce Steps to reproduce the behavior:
time git clone https://github.com/nunocoracao/blowfish blowfish-test
On my reasonably fast CI-VM, the process takes around 21 Seconds.du -h --max-depth 1 ./blowfish-test/
Total size: 736MB, biggest folders:
.git
(326MB),exampleSite
(221MB) andpublic
(169MB).du -h --max-depth 1 ./blowfish-test/exampleSite/
Biggest folders:
content
(121M),resources
(84MB) andassets
(17MB).Expected behavior The build performance for CI builds -- and the general size of the repository -- could be improved by one of the following points:
public
folder from repository, it contains the generated webpage fromexampleSite
, that could easily be regenerated by runninghugo
. (removes 169MB - also remove the git history of this folder to shrink the.git
folder.)exampleSite
folder in its own submodule, so a shallow checkout of the theme would be around half of the current size (removes 221MB, rewrite the git history to shrink the.git
folder as well.)exampleSite/resources/_gen/images
from the repository, these files are generated by the hugo build process if needed (size improvement is at least 84MB).exampleSite/assets/img
: convertingpng
photographs tojpg
would take up a lot less space.Screenshots none.
Desktop (please complete the following information):
Hugo & Blowfish versions Hugo 0.101.0, Blowfish latest commit from yesterday.
Additional context See recommended .gitignore file for hugo projects: Hugo.gitignore