reducing `nu_scripts` git repo size by removing images and some obsolete file and rewriting history

maxim-uvarov commented 1 month ago

Now the repo size seems huge to me:

> du | sort-by physical -r | first 4 | select path physical
╭─#─┬───────────────────────path────────────────────────┬─physical─╮
│ 0 │ /Users/user/git/nu_scripts_dev/.git               │ 297.5 MB │
│ 1 │ /Users/user/git/nu_scripts_dev/themes             │  58.6 MB │
│ 2 │ /Users/user/git/nu_scripts_dev/custom-completions │   5.5 MB │
│ 3 │ /Users/user/git/nu_scripts_dev/modules            │   1.1 MB │
╰─#─┴───────────────────────path────────────────────────┴─physical─╯

apparently because images are stored here.

Can we reduce the size of the repo by storing images using git-lfs?

I can do all the preparations and transfer the results to contributors with push rights.

fdncred commented 1 month ago

Sounds good to me. I don't know anything about git-lfs. Does it change the way you interact with the repo?

maxim-uvarov commented 1 month ago

@fdncred Here is the project's site: https://git-lfs.com

Users who have cloned the 'nu_scripts' repository without installing 'git-lfs' will see text files in place of binary files, like this:

version https://hawser.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
(ending \n)

For those who have it installed, nothing will change.

On the GitHub site, images in markdown documents will be displayed as usual, and no changes are needed.

Here are instructions on how to move current files to 'git-lfs': https://docs.github.com/en/repositories/working-with-files/managing-large-files/moving-a-file-in-your-repository-to-git-large-file-storage

In short we will need to use 'filter-repo' to rewrite history and remove binary files. But before that we will need to accept or reject the current PRs

I've done it only once before, so I will review all the steps and let you know what's needed. I will rehearse everything on my clones to make sure that everything will go well and show you results.

Does that sound okay?

fdncred commented 1 month ago

What happens when someone doesn't have git lfs installed after we've made this change?

maxim-uvarov commented 1 month ago

@fdncred:

What happens when someone doesn't have git lfs installed after we've made this change?

In place of example_image.png, they will see a text file with content like:

version https://hawser.github.com/spec/v1
oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
size 12345
(ending \n)

Those, who have git-lfs - they will see regular png image.

maxim-uvarov commented 1 month ago

This operation has benefits and considerations that I propose to discuss and communicate with the community.

However, I'm still gathering information, and these are just intermediate results.

I have tested all the operations, and the results can be found here:
https://github.com/maxim-uvarov/nu_scripts_reduced_size

# Run git-filter-repo
python3 /Users/user/.local/bin/git-filter-repo --path-glob '**/*.png' --invert-paths --force

# Initialize Git LFS
git lfs install
git lfs track "*.png"

# From the unmodified repository directory
let original_repo_pngs = glob **/*.png | wrap orig | insert curr {|i| $i.orig | path relative-to (pwd)}

# Copy PNG images to the new repository
$original_repo_pngs | each {|i| cp $i.orig $i.curr}

Benefits:

Faster Cloning and Operations: A smaller repository size leads to quicker clone times and improved overall performance.
Clean History: Eliminates unwanted files or sensitive data from the repository's history.

Considerations and Caveats:

Merged Pull Requests:
- Merged PRs will display old commits and diffs that no longer match the rewritten history, potentially causing confusion.
Open Pull Requests:
- Open PRs may become unmergeable due to history divergence.
- Contributors will need to rebase or update their branches to align with the new history.
Commit References:
- References to old commit hashes in issues or comments will become invalid, leading to broken links.
Forks:
- Forked repositories will contain the outdated history.
- Fork owners must rebase or re-clone to sync with the updated repository.

maxim-uvarov commented 1 month ago

I've just found out that bandwidth from 'git lfs' is charged, and only 1 GB is free.

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage#tracking-storage-and-bandwidth-use

Plus, it has other downsides like losing signatures from signed commits, and PR history will point to the wrong commits.

Even though it would be much tidier to have a smaller repo size, the downsides are quite noticeable. I will stop here.

maxim-uvarov commented 1 month ago

Though for new binary objects, it would be better to use 'git lfs' to avoid increasing the repo size.

fdncred commented 1 month ago

Another thought is to put the screenshots in another repo and have a submodule in this repo. I think then you could clone and only get the scripts unless you wanted both. I don't really know how all that works though.

maxim-uvarov commented 1 month ago

I like this variant, and it is feasible to do with the workflow that I described.

But again, it will also involve rewriting history to reduce the size of this repo, which might be a bit painful for current contributors and their forks.

I think I should ask the community when I have the energy for that.

NotTheDr01ds commented 1 month ago

But again, it will also involve rewriting history to reduce the size of this repo, which might be a bit painful for current contributors and their forks.

Personally I think that's a hit we're ultimately going to have to take, so probably the sooner the better.

I think I'd recommend keeping the issue open, since it is an issue, even if we don't have a full solution just yet.

maxim-uvarov commented 1 month ago

Here is the rationale and the actions that I propose.

# Let's check the size of the `.git` folder
> du .git/ | get 0.physical
236.2 MB

let $a = git verify-pack -v .git/objects/pack/*.idx | lines | split column -r ' +' | rename SHA-1 type size size-in-packfile offset-in-packfile depth base-SHA-1 | where type == 'blob' | select SHA-1 size | into filesize size
let $b = git rev-list --objects --all | lines | split column -r ' +' | rename SHA-1 filename
let $sizes = $a | join $b SHA-1 | sort-by size -r

# Let's examine the largest files
> $sizes | first 5 | print
╭─#─┬──────────────────SHA-1───────────────────┬───size───┬──────────────────────filename───────────────────────╮
│ 0 │ 87cd314f73c3bd004ea52b060d8a8d6bf40fcbe7 │   1.3 MB │ custom-completions/auto-generate/completions/mvn.nu │
│ 1 │ 09e461721303572ea79c390feec763c498b68335 │ 369.0 KB │ assets/gh-emoji.json                                │
│ 2 │ c2dc98f8c9aa296affb97d2dbeb84f2c6dc09adf │ 241.6 KB │ themes/screenshots/github-light-colorblind.png      │
│ 3 │ 9a258215d6435fac43dbe125b8cf2d77a6cc5b82 │ 241.3 KB │ themes/screenshots/github-light-default.png         │
│ 4 │ e8557e9b8cbb79b0705792cb316e46bfd98c5559 │ 240.6 KB │ themes/screenshots/github-light.png                 │
╰─#─┴──────────────────SHA-1───────────────────┴───size───┴──────────────────────filename───────────────────────╯

# Notice that `custom-completions/auto-generate/completions/mvn.nu` seems broken and therefore useless. So I propose to delete it as well.

 Let's check how much space non-'.png' files occupy
> $sizes | where filename !~ '.*\.png' | get size | math sum | print
7.0 MB

# Let's check the size of .png files
> $sizes | where filename =~ '\.png' | get size | math sum | print
253.4 MB

# Now we execute the command that will remove broken completions in `mvn.nu` and all binary `.png` files
python3 /Users/user/.local/bin/git-filter-repo --path custom-completions/auto-generate/completions/mvn.nu --path-glob '**/*.png' --invert-paths

> du .git/ | get 0.physical
4.4 MB

# The .git folder size was reduced from 236.2 MB to 4.4 MB.

# And now I push the results to my home repo
> git remote add mu https://github.com/maxim-uvarov/nu_scripts_reduced_size
> git push mu main --force

fdncred commented 1 month ago

Is there a way to render the themes as text/html/markdown/ansi so we could store that instead of images?

NotTheDr01ds commented 1 month ago

Possibly? I'm not a CSS wiz, so I'm not sure how difficult it would be

As for Markdown, we'd have to find a library that supports RGB colors through ANSI. I'm not sure whether Shiki does or not. The one used by Discord, for instance, only supports color names. I found that out the hard way when I tried to copy ANSI output from Nushell to Discord only to find out that my theme was the problem. I had to switch to nu -n to get the default color_config without RGB codes.

NotTheDr01ds commented 1 month ago

Some additional thoughts on this.

I've been playing with NUPM and other options for installing packages, and nu_scripts is problematic. Users will typically clone nu_scripts in order to install not just themes, but also stdlib-candidate/rfc. 250mb to install stdlib-candidate is a bit extreme. Multiply that by hundreds (or hopefully, eventually, thousands) of users and we're talking some serious bandwidth and storage consumption.

It's also slow. What should be a split-second download is ~20 seconds on a 1gbps link.

Even at 250mb, I think we need to fix it. It may not sound like much for a repo, but this is also a package that we want Nushell users (not just developers) to be able to install.

There are two stages to this:

First we need a solution to getting the images out of the repo - Whether HTML, a sub-repo, LFS, or some other solution.
Second, we need to get this repo back down to a reasonable size for user-installs. Perhaps I'm naive and oversimplifying here, but it seems to me that renaming and archiving the existing repo would allow folks to keep commit history, PR history, etc. Then we just start "fresh" in a new nu_scripts (or pick a new name) repo. But if we had to bite the bullet on a scripting-rewrite, I'd be okay with that as well.

fdncred commented 1 month ago

We'll lose all stars if we start a new repo. Not really interested in that.

maxim-uvarov commented 1 month ago

Also, there is some discussion on the topic in the general channel discord

and in this thread

maxim-uvarov commented 1 month ago

@NotTheDr01ds:

perhaps I'm naive and oversimplifying here, but it seems to me that renaming and archiving the existing repo would allow folks to keep commit history, PR history, etc.

I don't see much problem with PRs and commits history. This is history of commits of my fork, where I have run git lfs-script. As all the commit messages are preserved - it's not a problem to identify the PR and see all its discussion and comments if there were any.

For example in this commit: https://github.com/maxim-uvarov/nu_scripts_reduced_size/commit/1a5fac0050660f0779c1651bde336f7525e84a70 we see the number of the PR. Further, with this number we can find a PR and just go through all the comments there. Links to hashes will continue to work. The only thing - the opened pages with commits will tell that those commits don't belong to the current repository:

When you delete a public repository, the oldest, active public fork is chosen to be the new upstream repository. All other repositories are forked off of this new upstream and subsequent pull requests go to this new upstream repository. source

maxim-uvarov commented 1 month ago

Rewriting the history is supposed to be done by a trusted member of the core team.

This means that the only things that will change are:

Commit hashes of every commit after the first file deletion.
Blobs of images will be deleted from .git/objects.

Messages and authors of commits, along with their dates, will remain unchanged.

The consequences don’t seem too bad to me because:

We can have a backup clone of the repo that won’t sync anymore. All the hashes and commits there will remain unchanged. We can write some markdown explaining what happened and add a link to the backup clone. And we can commit this markdown immediately after the rewrite.

Additionally, we can add a comment to all completed PRs on GitHub, explaining where to find the history of changes if needed (even though all the links to hashes will still work, GitHub will indicate that the commits no longer belong to the repo).

So from my point of view, we won’t harm the repository’s archeology much.

The most problematic issue, as I see it, involves current unmerged PRs, forks that will need to be resynced or rebased, and the related inconveniences.

NotTheDr01ds commented 1 month ago

Probably worth adding this message from @devyn on Discord in here for the context/history:

I agree with you but just so you know history rewriting (of the sort that LFS does) doesn't mess up anything with commit metadata like date, author, description, etc. - it modifies the contents and therefore you end up with different commit SHA and also requires people to force their local branches to it, which is not ideal, but you don't really lose history

I guess commit signatures are also lost, since you can't forge those (obviously) that's about it

NotTheDr01ds commented 1 month ago

Related: PoC to svg command in Nushell -- Getting theme previews into SVG would be a pretty big space savings on the artifacts (regardless of where we store them):

test1

Was generated using pure-Nushell with caveats:

[[ col-names         val1        val2  ];
[foo                  1          4   ]
[bar                  5          5   ]
[baz                  7         10   ]
[cat                -12        100   ]
] | to svg

Blink and strike-through attributes were added by hand (for now)
Colors must be RGB (not named) at the moment (for now)

fdncred commented 1 month ago

it's progress 🚀

fdncred commented 1 month ago

here's a svg i created using ansi2 crate. it's not perfect but interesting. it's way too big too, needs optimization, e.g. each character has tags which bloats it. ls-xterm-no-font

NotTheDr01ds commented 1 month ago

My current progress on to svg:

The files below are almost entirely generated with to svg. The only modification I made manually was setting the background rect color and changing the width and height. It's currently handling:
1. ANSI RGB colors
2. ANSI SGR colors
3. ANSI xterm colors
4. Bold, underline, strikethrough
5. Combining any of the above into a single escape
It uses a much more optimized layout rather than specifying the exact placement of characters and text like ansi2 above. Each line is simply a tspan offset (dy) from the previous line.
That captures 95% of everything I'm throwing at it so far. TODO:
- Calculate height and width automatically
- Tighten up the line spacing slightly to remove the gaps in line-drawing
- Set the background and default foreground color -- For now, this might have to be done via a flag unless I can get a reliable way to query the terminal state. I'm still working on getting the example above to work.
- Background color and reverse -- For instance, I can't currently render the readme.md in the example above, , but I at least know a way to do it that is far more efficient than the ansi2 crate. Just need to implement.
- Italics, hidden, dimmed
- Blink (sure, why not)
- Handle ANSI codes that cross line-boundaries (already architected for it; just need to implement)
- Possibly rewrite using to xml
ls | to svg | save ...
eza -l | to svg | save ...
ls | lolcat ...

Even though this obviously has to wrap every character (since every one is a different color), it's still half the size of the ansi2 version.

fdncred commented 1 month ago

That's very cool! Can't wait to see your script.

nushell / nu_scripts

reducing `nu_scripts` git repo size by removing images and some obsolete file and rewriting history #970