git / git-scm.com

The git-scm.com website. Note that this repository is only for the website; issues with git itself should go to https://git-scm.com/community.
https://git-scm.com/
MIT License
2.17k stars 1.22k forks source link

Migrate git-scm.com to a static site, generated via Hugo, served via GitHub Pages #1804

Open dscho opened 11 months ago

dscho commented 11 months ago

Changes

This Pull Request adjusts the existing files such that the site is no longer served via a Rails App, but by GitHub Pages instead. A preview can be seen here: https://dscho.github.io/git-scm.com/ (which is generated and deployed from this Pull Request's branch, and will be updated via automation whenever that branch changes).

It is the culmination of a very long, and large, effort that started in February 2017 with the first attempt to migrate the site to Jekyll. Several years, and a substantial effort by @spraints, @vdye and myself, later, here is the result: No longer a Jekyll site but a Hugo site (because of render times: 20 minutes vs 30 seconds), search implemented using Pagefind, links verified by Lychee.

The main themes of the subsequent migration from the Rails App to a Hugo-generated static site are:

Context

Changes required to finalize the migration in addition to this Pull Request

Why make these changes?

spraints commented 11 months ago

:tada: This is great! Thank you so much for picking this up! The demo site looks great!

bglw commented 11 months ago

👋 Sneaking in here with some thoughts from the search side!

On first interactions, the search has some notable issues compared to the production rails search, for a few reasons on both sides of the fence.

  1. All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...
    • The best fix here would be for you to omit the data-pagefind-body attribute from the numbered release pages, so that only /docs/git-rebase/ is indexed and returned
  2. Titles definitely need stronger affinity here. A search for list on the rails site returns rev-list-description, git-rev-list, and rev-list-options as the top results. Pagefind's search is significantly more varied, with a lot of results for mailing lists and related items.
    • CloudCannon/pagefind#437 is relevant and discussing much the same thing.
    • I don't have an immediate solution for this but I would love to find one.
  3. Typing rebase into the live search and hitting enter does not show the rebasing book result. Typing the query in does.
    • This helped narrow down a bug — filed as CloudCannon/pagefind#478
  4. The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

(Amazing work migrating this to Hugo! ❤️)

dscho commented 11 months ago

Oh wow, Mr Pagefind himself! I'm honored to meet you, @bglw!

  • The best fix here would be for you to omit the data-pagefind-body attribute from the numbered release pages, so that only /docs/git-rebase/ is indexed and returned

I kind of wanted to be able to find stuff in old versions that is no longer present in current versions. That's why I added https://github.com/dscho/git-scm.com/commit/e9fa9630417b075b4a136518ea4dfbc7a1e884f4).

  • Titles definitely need stronger affinity here. A search for list on the rails site returns rev-list-description, git-rev-list, and rev-list-options as the top results. Pagefind's search is significantly more varied, with a lot of results for mailing lists and related items.

Excellent!

Heh, thank you for that!

  • The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

Right, I had not worked on that because I hoped that the sorting by relevance would be "good enough"...

rimrul commented 11 months ago

About Heroku

That is true, but here has been an update since that 2022 mail.

https://lore.kernel.org/git/ZRHTWaPthX%2FTETJz@nand.local/

Heroku has a new (?) program for giving credits to open-source projects. The details are below:

https://www.heroku.com/open-source-credit-program

I applied on behalf of the Git project on 2023-09-25, and will follow-up on the list if/when we hear back from them.

It does seem like the PLC is still in favor of moving to a static solution, though.

https://lore.kernel.org/git/ZRrfAdX0eNutTSOy@nand.local/

  • Biggest expense is Heroku - Fusion has been covering the bill
  • Dan Moore from FusionAuth has been providing donations
  • Ideally we are able to move away from using Heroku, but in the meantime we'll have coverage either from (a) FusionAuth, or (b) Heroku's new open-source credit system

About the preview:

Search

All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...

That is true. And in both the search results page as well as the little preview (<div id="search-results">) it's not visually obvious which result is the current version and which results are older versions. Maybe that could be improved by adding the version number to the page title for non-current versions? Or maybe a filter in the search results to exclude historical documentation? If we don't want to mangle the titles, pagefind would show the version number below the result if we configured it as metadata.

Minor issues

There are some broken links in the preview on https://dscho.github.io/git-scm.com/docs/ that lead to https://dscho.github.io/docs/ \

There's a broken link on https://dscho.github.io/git-scm.com/about/free-and-open-source/ to https://dscho.github.io/git-scm.com/trademark. On the live site that redirects from https://git-scm.com/trademark to https://git-scm.com/about/trademark (https://github.com/dscho/git-scm.com/pull/1)

The "Setup and Config" headline on https://dscho.github.io/git-scm.com/docs/ is blue in the preview, but not in the live site. This is not happening for me in local testing.

There's some redirect that swallows anchors. https://dscho.github.io/git-scm.com/docs/ links to https://dscho.github.io/git-scm.com/docs/git#_git_commands , which redirects to https://dscho.github.io/git-scm.com/docs/git/ instead of https://dscho.github.io/git-scm.com/docs/git/#_git_commands Looks like the slash-free version isn't possible with the GitHub pages/Hugo combination (https://github.com/gohugoio/hugo/issues/492). We should update these links to contain the slash from the beginning to avoid the redirect.(https://github.com/dscho/git-scm.com/pull/3)

https://dscho.github.io/git-scm.com/downloads/mac/ has an odd grammar issue that https://git-scm.com/download/mac doesn't. (https://github.com/dscho/git-scm.com/pull/2) It says

which was released about 2 year, on 2021-08-30.

https://git-scm.com/download/mac correctly says

which was released about 2 years ago, on 2021-08-30.

Also note the slight url change there from dowload to downloads. There is a redirect for that, though, so that should be fine.

rimrul commented 11 months ago

One additional note: There is a commit about porting the old 404 page, 18a3ac2, but I've only seen the generic GitHub pages 404 page on the preview in my testing.

rimrul commented 11 months ago

Switching to pagefind also changed search behaviour in another way.

The rails site always searches the english content. Pagefind defaults to what they call multilingual search, i.e. searching only pages in the same language as the one you're searching from. That's theoretically a usability improvement, but with the partial nature of our non-english content (availability of any given language can vary from man page to man page, the book exists in languages that don't have any man pages, everything else only exists in english), we might need a fallback to english here. Pagefind offers an option to force all pages to be indexed as english, but I think we can slightly abuse mergeIndex with language set to en for a better result.

dscho commented 10 months ago

The "Setup and Config" headline on https://dscho.github.io/git-scm.com/docs/ is blue in the preview, but not in the live site. This is not happening for me in local testing.

I managed to fix it via 2d0f6c80293192f7882914e7f6a683c60afe3159

dscho commented 10 months ago

All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...

That is true. And in both the search results page as well as the little preview (<div id="search-results">) it's not visually obvious which result is the current version and which results are older versions.

Hmm. The more I think about it, the more I get convinced that the older versions of the manual pages should be excluded from the search, I thought it was a feature, but it looks as if it incurs more downsides than upsides.

pedrorijo91 commented 10 months ago

this was a major effort @dscho , thank you very much! sorry for the silence, but i've been busy with other stuff. in the meanwhile, and to ensure this effort wont be wasted, can you summarize what do you need to make this merge-ready?

what do you still need to tackle? where do you need help from other people? :)

dscho commented 10 months ago

can you summarize what do you need to make this merge-ready?

@pedrorijo91 Yes.

The big blocker is the "live search" one.

dscho commented 10 months ago

Oh, and there's a ton of work still needed to address @rimrul's excellent feedback.

dscho commented 10 months ago
  • general QA:

    • ensure that current URLs would work after migration

      • e.g. /about#branching-and-merging, /about#staging-area etc

@pedrorijo91 TBH I would love to have help with that.

dscho commented 10 months ago
  • ensure that current URLs would work after migration

    • e.g. /about#branching-and-merging, /about#staging-area etc

@pedrorijo91 TBH I would love to have help with that.

I just realized that https://git-scm.com/about#branching-and-merging does not actually redirect to https://git-scm.com/about/branching-and-merging... so I guess this is a non-issue.

dscho commented 10 months ago

Typing rebase into the live search and hitting enter does not show the rebasing book result. Typing the query in does.

@bglw I just tested this at https://dscho.github.io/git-scm.com/ and it seems to work as expected. Thank you!

  • The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

Right, I had not worked on that because I hoped that the sorting by relevance would be "good enough"...

I worked on this (7142149b5, ddbbe381c and 08183b0b0) and it seems to work now. Could you please test?

dscho commented 10 months ago

@pedrorijo91 I believe that this is now ready for wider testing. Do you have any objections against me pushing this to gh-pages and enabling the Actions to deploy to https://git.github.io/git-scm.com/?

pedrorijo91 commented 10 months ago

i agree that's likely the best way to test the new website @dscho . kinda impossible to review this huge diff manually :D

dscho commented 10 months ago

@bglw wow, the innocuous release notes item "Fixed a bug, resulting in a (very) large improvement to the NodeJS Indexing API performance (~100x)." seems to have a profound impact. While it is definitely not a scientific experiment (read: take the numbers with a grain of salt), the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s. Well done!

dscho commented 10 months ago

i agree that's likely the best way to test the new website @dscho .

Thank you @pedrorijo91. It's live! https://git.github.io/git-scm.com/

kinda impossible to review this huge diff manually :D

Right, I should have clarified that the majority of the diff is in the generated pages that do not actually need to be reviewed because they come from external sources where they are reviewed already. For example, content/book/ and content/docs/ contain only one non-generated file: content/docs/_index.html. You can see that in the tree of the commit before all the generated pages were added by automated GitHub workflow runs: https://github.com/git/git-scm.com/tree/ef17ce6ee91e30aba30e37478104b4384d9142ea/content

bglw commented 10 months ago

the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s.

Interesting! That bug fix should only be affecting this NodeJS API — not npx usage — so it's either just an outlier run, or something else in this release has an outsized performance impact 🤔 In either case, glad to hear it's running a bit faster 😅

(edit: I think you just landed a much faster machine — the Hugo build time also dropped from 24s in your first link, to 16s in the second)

dscho commented 10 months ago

the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s.

Interesting! That bug fix should only be affecting this NodeJS API — not npx usage — so it's either just an outlier run, or something else in this release has an outsized performance impact 🤔 In either case, glad to hear it's running a bit faster 😅

Huh. So it might actually be a fluke. I just thought that npx, being a node.js way to generate the search index, would internally use the node.js API ;-)

(edit: I think you just landed a much faster machine — the Hugo build time also dropped from 24s in your first link, to 16s in the second)

Possible. I experienced something like that recently in a different context, where subtle differences between the large macos runners relative to the non-large ones caused git/git CI to fail (because Python2 was on the PATH in the non-large runners, but hidden in the large ones). So it's quite possible. Unfortunately, I do not see any breadcrumb in the logs to confirm or deny that the job is running on a large runner...

rybak commented 10 months ago

Bug report Has been fixed

HTML entities are rendered verbatim in version dropdowns in documentation reference. Fixed now.

Bug report contents click to expand

### Steps to reproduce 1. Go to 2. Click on dropdown "Version 2.40.0 ▾" 3. Observe the dropdown between items for versions 2.40.0 and 2.39.0 ### Actual result The dropdown has small text in italics on the right `2.39.1 → 2.39.3 no changes` Actual result ### Expected result The dropdown has small text in italics on the right `2.39.1 → 2.39.3 no changes` Expected result

ossilator commented 10 months ago

it strikes me as a bad idea to commit all the generated content to the main repo. how about a submodule?

rimrul commented 10 months ago

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this.

https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

dscho commented 10 months ago

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this.

https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

Actually, that does not fix it for me, but this here diff does:

diff --git a/layouts/partials/ref/versions.html b/layouts/partials/ref/versions.html
index 3eca7a4c5..6ca23230f 100644
--- a/layouts/partials/ref/versions.html
+++ b/layouts/partials/ref/versions.html
@@ -28,7 +28,7 @@
         </a>
         </li>
       {{ else }}
-        <li class="no-change"><span>{{ $v.name }} no changes</span></li>
+        <li class="no-change"><span>{{ safeHTML $v.name }} no changes</span></li>
       {{ end }}
     {{ end }}
     <li>&nbsp;</li>

I'll commit it as a fixup for 0501ad1ad1b94f70821ab79fcfd0365ab2e5b3ae.

dscho commented 10 months ago

HTML entities are rendered verbatim in version dropdowns in documentation reference.

@rybak thank you for the detailed bug report!

it strikes me as a bad idea to commit all the generated content to the main repo. how about a submodule?

@ossilator Friends don't let friends use submodules.

Seriously speaking again, it seriously won't work with submodules because those pages need to be generated in GitHub workflows with write access to the repository (so that the changes can be pushed), which is not possible (or at least not in a way that makes it easy to contribute) in a workflow that is defined in a different repository. Besides, the generated files need to live in subdirectories of content/ that are not always completely generated. For example, content/docs/_index.html is not generated. And hugo.yml is partially re-generated (download data and Git version), so that generated data has to be in the same repository.

In addition to making it harder to contribute, submodules would also make the deployment to GitHub Pages more fragile because of the need to clone multiple repositories for the price of one.

No, I fear that the submodules idea is actually the bad idea, not the one to commit generated files in well-defined places ;-)

dscho commented 10 months ago

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this. https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

Actually, that does not fix it for me, but this here diff does:

diff --git a/layouts/partials/ref/versions.html b/layouts/partials/ref/versions.html
index 3eca7a4c5..6ca23230f 100644
--- a/layouts/partials/ref/versions.html
+++ b/layouts/partials/ref/versions.html
@@ -28,7 +28,7 @@
         </a>
         </li>
       {{ else }}
-        <li class="no-change"><span>{{ $v.name }} no changes</span></li>
+        <li class="no-change"><span>{{ safeHTML $v.name }} no changes</span></li>
       {{ end }}
     {{ end }}
     <li>&nbsp;</li>

I'll commit it as a fixup for 0501ad1.

This is now fixed (in 367254d3a) and deployed. Thank you @rybak!

ossilator commented 10 months ago

Besides, the generated files need to live in subdirectories of content/ that are not always completely generated. For example, content/docs/_index.html is not generated. And hugo.yml is partially re-generated (download data and Git version), so that generated data has to be in the same repository.

to me this sounds like a complete nightmare. not cleanly separating the sources from the generated content is a recipe for undesired "special effects" of all kinds. and that's atop of obvious issues of working with the repo itself.

why does the generated content need to be versioned in the first place? can't github just serve the build artifacts? for all i can tell you just need a simple configuration management system.

dscho commented 10 months ago

@ossilator I appreciate that you think about these issues.

But generating everything from scratch every time, that would be hell twice over. That's a nightmare. Too many things that could go wrong and testing locally would be another nightmare on top.

And your suggestion to use submodules actually gave me the creeps. I've been using submodules in the past and there are many good reasons why I don't do that anymore. I know many, many engineers with the same learning trajectory.

And honestly, I definitely do not understand why you're so averse against committing generated content. It makes so many things much easier, from easily being on the same page when two contributors are looking at the same generated page, to testing locally, to link checking, to running this in a GitHub workflow after a new Git version was released and expecting updates as quickly as possible.

Merely looking at re-generating all of the manual pages makes re-generating them all the time distinctly a total non-starter. That would be adding over 10 minutes to every single deployment, for work that really only needs to be done once!

No, in this instance, committing what has been generated, by automation that can be trusted and verified, you basically know at all times what you've got, there are no hidden surprises. You know from which progit2/progit2 commit this and that file was generated, and you can verify that it was generated correctly by re-running the script and calling git diff.

So from a practical point of view, if you want to accept this from a person who has worked on this project for over a year and hence has gained a lot of experience in this space (i.e. me), committing the generated content in a well-defined way, to well-defined locations within the same repository, is making everything a lot less painful than it would otherwise be.

And if you're still not convinced, I would love to be presented with hard evidence (read: not just talk) that stands a chance of convincing me that your suggestion should be preferred over the current proposal.

vdye commented 10 months ago

@ossilator I appreciate that you think about these issues.

But generating everything from scratch every time, that would be hell twice over. That's a nightmare. Too many things that could go wrong and testing locally would be another nightmare on top.

And your suggestion to use submodules actually gave me the creeps. I've been using submodules in the past and there are many good reasons why I don't do that anymore. I know many, many engineers with the same learning trajectory.

I know you're passionate about this topic, but there's no need for hostility (re: "gave me the creeps"). The suggestion of submodules seemed more the result of some initial brainstorming w.r.t avoiding the storage of generated files in the repo. It's a starting point for a conversation, not a firm design proposal.

And honestly, I definitely do not understand why you're so averse against committing generated content. It makes so many things much easier, from easily being on the same page when two contributors are looking at the same generated page, to testing locally, to link checking, to running this in a GitHub workflow after a new Git version was released and expecting updates as quickly as possible.

As someone that maintained a repository in the past with a similar concentration of generated files, there are a number of things it can make harder as well:

There are probably more specific issues I'm not remembering but, overall, I can say that maintaining a ton of generated files was indeed a nightmare for myself and other developers. The only reason I didn't jettison them when I was maintainer is that I never got the time to update the tooling accordingly.

All that said, you've made some valid points as to why we should store generated files. So IMO the decision of whether or not to commit generated files is fairly nuanced, and warrants discussion & possibly further investigation before settling on an approach.

Merely looking at re-generating all of the manual pages makes re-generating them all the time distinctly a total non-starter. That would be adding over 10 minutes to every single deployment, for work that really only needs to be done once!

10 minutes doesn't seem too bad, to be honest. But one way to avoid that while still keeping generated files out of the repo could be to store them as artifacts (e.g., a tarball of the generated files) tied to a given commit hash in the artifact storage of your choice, then use that as a sort of "pre-build" of the repository.

No, in this instance, committing what has been generated, by automation that can be trusted and verified, you basically know at all times what you've got, there are no hidden surprises. You know from which progit2/progit2 commit this and that file was generated, and you can verify that it was generated correctly by re-running the script and calling git diff.

So from a practical point of view, if you want to accept this from a person who has worked on this project for over a year and hence has gained a lot of experience in this space (i.e. me), committing the generated content in a well-defined way, to well-defined locations within the same repository, is making everything a lot less painful than it would otherwise be.

As someone that has also worked on this project (albeit not as extensively), I'm not convinced that committing generated content is the right way to go. Personal experience is valuable in informing your opinions, but it is not on its own a justification for the correctness of your approach, and it's definitely not cause to dismiss @ossilator's (or anyone else's) concerns out of hand.

And if you're still not convinced, I would love to be presented with hard evidence (read: not just talk) that stands a chance of convincing me that your suggestion should be preferred over the current proposal.

It's generally the job of the person developing a change to convince reviewers to accept that change, not the other way around. Reviewers can certainly help that process by providing technical justification when they disagree with an approach, but it's nevertheless important to understand & address concerns so that we reach a consensus based on technical merit. After all, wouldn't it be better for everyone if alternatives are thoroughly explored? If they don't end up better than what you have now, at least everyone will understand why we settled on a given approach. And if it is better than what you have now, then we end up with...something better!

I know that kind of exploration takes time, and you've already put a lot of time into this, so what I'm asking is probably more frustrating than not. But with such a massive change to such a valuable resource, it's critical that concerns are thoroughly addressed before moving forward on merging/deploying.

dscho commented 10 months ago

How would we make it easy to work with artifacts attached to commits, especially on PR branches?

I really like the simplicity of pushing to my fork and having a deployed site after the workflow run is done. Minimal surface for network issues because only one repository is checked out. And I can't think of any way to make it as simple without committing the generated files, I'm sorry.

dscho commented 10 months ago

I should also mention that I relied heavily on sparse checkouts (non-cone mode) to develop these changes. That gave me a very small section of the generated files to work with, accelerating the hugo/check/modify cycle. Also something I can't see being as convenient in any other setup.

I did think about separating better between generate vs non-generated files, via url entries in the front matter and then having a content/generated/ cone. It conflicted with my mental model, though.

As I rebased literally hundreds of times, submodules would have made my life so much worse, so I thanked myself for dismissing that idea (also because, as stated earlier, it would make automation more complex and fragile, not to mention development in PR branches).

Besides, the Rails App kept the generated files in a database. We're simply doing the same here, using Git as the database.

dscho commented 10 months ago

I definitely do not understand why you're so averse against committing generated content. It makes so many things much easier, from easily being on the same page when two contributors are looking at the same generated page, to testing locally, to link checking, to running this in a GitHub workflow after a new Git version was released and expecting updates as quickly as possible.

As someone that maintained a repository in the past with a similar concentration of generated files, there are a number of things it can make harder as well:

  • It's difficult to enforce "don't update the generated files" (even with "DON'T UPDATE THIS FILE" banners, READMEs, etc.), so when people inevitably try it, it leads to more time spent going back-and-forth on pull requests.

That's a valid point, and it is a social "problem" that I consider requiring social solutions: communication is key. If a contributor updates generated files, a kind redirect to the sources of those generated files is required (read: a friendly reply by a reviewer).

  • It's not necessarily straightforward (esp. for new contributors) to figure out which file(s) need to be changed to update something they see in a given generated file (is it the main body of the file? header or footer? etc.).

That is true.

Compared to the current situation in the Rails App, though, I would assume that everyone can agree that the process proposed in this here PR is a net improvement:

  • Changes in the generation process can result in massive diffs across the repository that don't add any real value.

I would disagree with the assessment about the value, given that I very frequently cross-validated fixes in script/*.rb by studying the diff of the generated files, something that would have been substantially harder with the current Rails App.

  • The generated files are usually not updated in the commit that prompted the change, making debugging/bisecting more difficult.

I can see your point about that.

You would be able to bisect changes in the generated output, but at the end would be pointed to the commit that persists those changes, and you would then have to find out which preceding commit changed the generated files before that, and then you would end up with a commit range of potential changes that are the root cause of those changes.

Or even worse, the source of the change might be outside of this repository, e.g. when a typo in a translated manual page was fixed (naturally, that would happen in https://github.com/jnavila/git-html-l10n instead of https://github.com/git/git-scm.com).

Yet again I want to point out, though, that committing the generated files is a net improvement: With the Rails App you may feel gas-lit about a problem you saw yesterday but that is no longer present today, with no record that the problem was there before and has been fixed. At least now we have a public record.

  • Subjectively, I don't see generated files as any different from other build artifacts (e.g. binaries), and I consider "storing build artifacts alongside source code" generally bad practice (muddles the concept of "source of truth").

While I agree in general with the paradigm to separate clearly between concerns (in this case, "source of truth" and "user-visible content"), I want to offer the pragmatic insight that in this scenario we basically add a caching layer.

Consider what the Rails App does with storing the generated content in the database: You could change it such that every time a user asks for, say, https://git-scm.com/docs/git-add, the App would determine the latest version from https://api.github.com/repos/git/git/tags (cannot use https://api.github.com/repos/git/git/releases/latest because Git does not publish releases, only tags), then pull the source from https://github.com/git/git/blob/$version/Documentation/git-add.txt, render it via AsciiDoctor (diligently retrieving the included files from the repository, too), and then, without ever caching, deliver this to the user. This would be arguably "more correct" than what the Rails App does right now. However, you will certainly agree that this would not only be slow due to an abundance of network handshakes, but would add a lot of fragility to the process, as the likelihood of network problems causing issues rises with the square of the number of network requests.

Now, you could argue that the caching layer shouldn't be the Git repository. But it is a database. And it is the most easily accessible caching layer in the context of this repository and a static website.

There are probably more specific issues I'm not remembering but, overall, I can say that maintaining a ton of generated files was indeed a nightmare for myself and other developers. The only reason I didn't jettison them when I was maintainer is that I never got the time to update the tooling accordingly.

All that said, you've made some valid points as to why we should store generated files. So IMO the decision of whether or not to commit generated files is fairly nuanced, and warrants discussion & possibly further investigation before settling on an approach.

Well, I can only reiterate that my work on this branch would have been substantially harder if the generated files had not been committed. From the prohibitive amount of time to generate and re-generate content, to avoiding network issues in the GitHub workflow runs, to being able to work on a sparse checkout to focus on the Hugo-related processing of a very small part of the generated content, to being able to verify fixes by looking at the diff (or better put, by letting sometimes elaborate commands do the verification for me; I cannot count the times I ran git diff @{1} -- ':(exclude)_gen*' ':(exclude)content/book' ':(exclude)content/docs' ':(exclude)data/docs.yml' ':(exclude)data/book-*' ':(exclude)_sync*' ':(exclude)static/book', for example, to verify that only the expected files were updated by a particular script invocation), to staving off unwelcome surprises stemming from changes in the progit* repositories between workflow runs, I cannot stress enough how much the simplicity and reliability of committing the generated content has helped me out.

Merely looking at re-generating all of the manual pages makes re-generating them all the time distinctly a total non-starter. That would be adding over 10 minutes to every single deployment, for work that really only needs to be done once!

10 minutes doesn't seem too bad, to be honest.

10 minutes may not seem so bad, but it would be tripling the wall-clock time of each deployment.

Not to mention the overall build minutes we would essentially waste: keep in mind that we are building the ProGit book for 30 languages. Sure, we could parallelize that in a matrix job (as I have done in update-book.yml). But do keep in mind that this adds to the fragility of the workflow runs, and that the free plan allows for "only" 20 concurrent jobs, adding to the wall-clock time.

Besides, it is 100% against my values to waste build minutes, even if we do not have to pay for them. The planet is burning, and I want no part in contributing to that, not even "small" things like re-running jobs many times that could have run only once if only their output had been persisted.

But one way to avoid that while still keeping generated files out of the repo could be to store them as artifacts (e.g., a tarball of the generated files) tied to a given commit hash in the artifact storage of your choice, then use that as a sort of "pre-build" of the repository.

I actually had thought about that, and even had thought about using Git LFS.

The idea of attaching the generated content to a certain commit was enticing at first, yet to me it appears not viable:

I also dismissed the idea of using Git LFS after realizing that we're not talking about a few big files, but about many, many small files.

I know that kind of exploration takes time, and you've already put a lot of time into this, so what I'm asking is probably more frustrating than not. But with such a massive change to such a valuable resource, it's critical that concerns are thoroughly addressed before moving forward on merging/deploying.

I apologize for letting my frustration show with the suggestion to use submodules.

When I saw a mere one-liner of a suggestion to use submodules, I immediately thought that such a proposal should have been backed up by a lot more consideration, and certainly be accompanied by some sort of discussion demonstrating that the suggester thought at least a couple of minutes about the ramifications and be presented in good faith, i.e. with concrete upsides in mind. But that's just an explanation for my response, not an excuse. Again, I sincerely apologize.

Had I been a bit more level-headed, I would have realized that there are many people who haven't (yet) been exposed to really bad experiences with submodules, not everybody has to deal with the kinds of problems I, in my role as Git for Windows maintainer, am regularly exposed to. And I agree that the concept of submodules seems quite elegant, on paper. At least that was my initial thought when they were introduced into Git.

Sadly, very sadly, my initial assessment had no chance of surviving. And it is definitely not just personal experience. Yes, I used submodules extensively in 2007, that is true, and I ran into so many problems (with rebasing, the inadequacy of git status with submodules, the complete nightmare of stacked git commit calls, the fragility of submodules' commits being lost forever due to forced pushes, the joy of switching branches in particular with uncommitted changes in submodules, just to name a few, and those challenges are as unresolved today as they have been back then) that I saw myself forced to go through a painful process of replacing these submodules.

But that is nothing, really nothing compared to the experience I am exposed to by virtue of assisting users, including enterprise customers, with Git issues.

The same issues and challenges and problems are reported to me over and over and over again. In addition to the ones I experienced personally, there are oh so many issues with collaboration: The lack of a streamlined submodule experience in Git seemingly causes an endless stream of frustrations where users forget to commit in submodules, or in superprojects, forget to push submodules, or push them to repositories collaborators cannot access, or cannot push because they have write permission only for a fraction of the relevant repositories, where friction is caused by git clone being completely happy to default to non-recursive clones while the build processes totally break down except with full recursive clones, where elaborate build processes have to be built that work around the discrepancies caused by Git working non-recursively by default for pretty much all operations and those complexities naturally inviting their own set of subtle bugs.

In my experience, the only people suggesting to use submodules are either people who were forced to work with submodules for such a long time that they built muscle-memory, or tooling, or both, to work around (and probably even forget about) submodules' shortcomings, or (much more often) people who do not use submodules themselves.

There are really good reasons why monorepos are not only a thing, but why you read many, many reports where serious Git users switched from using submodules to using monorepos instead, and the accounts to do the reverse are few and far between. So: don't take this from me, take this from the majority of Git power users.

Looking concretely at the git-scm.com requirements, I see many similarities to the challenges I mentioned above. Elaborate build processes would be needed to ensure integrity (you don't want deployments to use the wrong version of generated content). That integrity would have to be broken for local development (you want to use just-generated content and not be forced back to the submodules' HEAD). Diffing between generated output would be non-trivial. Determining provenance of given content would involve not only two repositories (for example, docs/git-add/fr involves jnavila/git-html-l10n and git/git-scm.com), but with submodules there would be another one. If PRs trying to change a generated file are a challenge when the generated files are committed into the same repository, PRs trying to change a file in a submodule containing only generated files are an even bigger challenge. It would be way too easy to work with stale submodules by mistake. You get the picture.

So hopefully it is clear now why I am opposed to using submodules, and hopefully I presented the evidence well to back that stance up. There are really, really good reasons not to use submodules in this context. And I do not see any good reason in favor of using submodules. If anybody sees one good reason, please offer me the chance to refute it with evidence.

dscho commented 10 months ago

I cannot count the times I ran git diff @{1} -- ':(exclude)_gen*' ':(exclude)content/book' ':(exclude)content/docs' ':(exclude)data/docs.yml' ':(exclude)data/book-*' ':(exclude)_sync*' ':(exclude)static/book'

Hmm. I just had an idea, even if I do not know how practical it is: would it help y'all if we were to introduce a generated/ directory and put all of the above, i.e. generated book and manual pages, their YAML data, sync state, cached expanded AsciiDoc and images.

Caveat: I do not know enough Hugo internals yet to assess whether that is even possible. But I'd like to know if that would alleviate some of the concerns y'all had before even looking into it.

ossilator commented 10 months ago

the qt project uses submodules, and from what i've experienced, it's not as painful as you suggest. esp. for an "artifact cache" repo with a one-way relationship to the main repo it shouldn't be that bad. also, as @vdye correctly assumed, you don't need to take the "submodule" suggestion in the literal git sense. any separate storage will do, including a git repo that isn't actually a submodule.

anyway, i find it somewhat hard to believe that you can't come up with a sensible build versioning scheme with easy diffing. what you need sounds like a bog-standard CI/CD workflow, and i'd be a tad surprised if github didn't support it adequately.

note that artifact persistence/caching doesn't have to be the same mechanism as the build snapshotting mechanism (which would serve as the basis for versioning and diffing) - you can have a content-addressable cache from which files are hard-/ref-linked into the output tree, just like ccache works.

always starting out with a clean output tree is a huge boon for reproducibility, and therefore bisect.

dscho commented 9 months ago

i find it somewhat hard to believe that you can't come up with a sensible build versioning scheme with easy diffing.

Well, I did. You're looking at it.

ossilator commented 9 months ago

Well, I did. You're looking at it.

one that also doesn't violate the "don't put build artifacts in the source repo" principle ...

dscho commented 9 months ago

Well, I did. You're looking at it.

one that also doesn't violate the "don't put build artifacts in the source repo" principle ...

In the context of GitHub Pages, which literally suggests to commit and push the verbatim .html pages to serve, insisting on that principle would not seem to be a particularly tenable position to hold.

I could see the point of a much more pragmatic suggestion to use module mounts so that files generated by a particular script, or files generated from a particular repository, could be put into, say, generated/<name>/ and mapped back into data, content, static etc. I did verify that that would work for multiple data directories, i.e. something like this

module:
  mounts:
  - source: data
    target: data
  - source: generated/data
    target: data

would allow docs.yml to live in generated/data/ and still be accessed via $.Site.Data.docs.<whatever>. Somewhat surprisingly, this configuration follows the "first one wins" approach: if there is a docs.yml in data/ as well as in generated/data/, and both files contain the same key albeit with different values, with the above configuration Hugo would resolve to the value specified in data/docs.yml and ignore that generated/data/docs.yml defines a different value.

However, this solution is somewhat in search of a problem, and it is not particularly simple, either. Kind of a complicator's glove: a solution for a perceived problem that requires its own set of follow-up changes that introduce their own set of problems. For example, assuming that one would want to have different cones for, say, each ProGit book translation, the module: section would become quite long (and error-prone!) as there would have to be at least three entries per translation: data, content and static would need to be mounted for certain. Combined with the added complexity the module mounts would require in the scripts, I frankly do not see any benefit in pursuing that route any further.

dscho commented 9 months ago

@bglw I received a report that the search results are somewhat counter-intuitive, and I think the reason is the somewhat unique way Git's manual pages present the name of the command they are describing: Instead of having the command name in a header, the description "NAME" is in a <h2>, and that header is then followed by a paragraph that is in a <div> contained in another <div>, and that paragraph contains not only the name of the command, but also a very short description.

For example, the beginning of the original .html generated for git log's manual page looks like this (indentation added for clarity):

<div class="sect1">
 <h2 id="_name"><a class="anchor" href="#_name"></a>NAME</h2>
 <div class="sectionbody">
  <div class="paragraph">
   <p>git-log - Show commit logs</p>
  </div>
 </div>
</div>

The way I tried to help Pagefind (fa3f045b7b5c6a23a8c25bc0bdfac03a671e85d0) to report that manual page as first match for the search term "log" turns that into the following, via Hugo:

<div class="sect1">
 <h2 id="_name"><a class="anchor" href="#_name"></a>NAME</h2>
 <div class="sectionbody">
  <div class="paragraph">
   <p data-pagefind-weight="8">git-<span data-pagefind-weight="10">log</span> - Show commit logs</p>
  </div>
 </div>
</div>

However, that does not seem to accomplish what I want it to accomplish, I thought giving the term "log" a weight of 10 would force it to be the top hit, but it is not even among the first 10 hits:

image

Help?

dscho commented 9 months ago

I thought giving the term "log" a weight of 10 would force it to be the top hit, but it is not even among the first 10 hits

In the Developer Tools' Javascript console on the page https://dscho.github.io/git-scm.com/docs/git-log, when I run x = await Search.pagefind.debouncedSearch("log"), I get /downloads/logos.html as first result, and only as the fifteenth hit git-log.html. Here is what x.results[0] looks like:

{ id: "en_6ca1bcb", score: 38.458725, words: (9) […], data: async data() }

Here is what x.results[14] looks like:

{ id: "en_bca381f", score: 0.3501713, words: (106) […], data: async data() }

And here is the beginning of (await x.results[0].data()).weighted_locations:

[
  {
    "weight": 7,
    "balanced_score": 51899.676,
    "location": 0
  },
  {
    "weight": 4,
    "balanced_score": 16946.832,
    "location": 4
  },
  {
    "weight": 4,
    "balanced_score": 16946.832,
    "location": 18
  },
  {
    "weight": 4,
    "balanced_score": 16946.832,
    "location": 33
  },
  {
    "weight": 4,
    "balanced_score": 16946.832,
    "location": 48
  },
  [...]
]

And here is the beginning of (await x.results[14].data()).weighted_locations:

[
  {
    "weight": 10,
    "balanced_score": 57600,
    "location": 2
  },
  {
    "weight": 8,
    "balanced_score": 36864,
    "location": 6
  },
  {
    "weight": 1,
    "balanced_score": 576,
    "location": 9
  },
  {
    "weight": 1,
    "balanced_score": 576,
    "location": 18
  },
  {
    "weight": 1,
    "balanced_score": 576,
    "location": 134
  },
  [...]
]

So despite the first score of the fifteenth hit being higher than the first score of the first hit, I guess the fact that the other locations' balanced_scores taper off too quickly with the former to be considered the clear winner?

ossilator commented 9 months ago

GitHub Pages [...] literally suggests to commit and push the verbatim .html pages to serve

this is the most basic example possible. it's not useful to refer to it here.

if you still wanted to take it literally, you'd be working with two separate projects: one with the sources, with actions, and one with the generated content for serving. but you implied that github wouldn't make this easy to automate.

However, this solution is somewhat in search of a problem,

it's not, for all the reasons @vdye pointed out, and probably more.

and it is not particularly simple,

because you seem to be looking for ways to modify things just slightly, rather than rethinking the model. i can't quite believe that hugo doesn't offer an adequate solution to a problem that literally every user who tries to practice "versioning hygiene" must face.

dscho commented 9 months ago

i can't quite believe that hugo doesn't offer an adequate solution to a problem that literally every user who tries to practice "versioning hygiene" must face.

Hugo has obviously nothing to do with this. It merely expects its input in Markdown or HTML form, and processes it by interpreting the layouts.

Nothing in the standard Hugo model expects input in AsciiDoc format. Or input that is provided in a separate repository.

So we're in a somewhat rare scenario where we not only render AsciiDoc, but we also want to import the sources from 3rd-party repositories that we do not control.

For what it's worth, I have thought long and hard about ways to cache the rendered AsciiDoc better. It is true that I never even considered submodules because they're just not even a good fit for git/git itself, so I honestly can't take that suggestion seriously. But I have considered GitHub workflow artifacts (fragile, and hostile to local development), .zip files (again, not friendly to local development), Git LFS (not applicable because we don't work with just a couple huge files but instead with loads of small files), I did think about a SQLite database (but where would that be stored? And again, hostile to local development), and about GitHub Packages (strikes me as quite wasteful because no "release" is interesting after a newer one is published, and once again, hostile to local development). I won't mention some more obscure methods to avoid embarrassment. And lastly, I also considered re-generating everything from scratch, but dismissed it as laughably ridiculous that any contributor would need to generate everything from scratch for >15 minutes every time they need to test a change (even for simple typo fixes of the front page!!!).

These considerations happened almost two years ago, and after settling on the current setup (tracking generated files, much like git/git tracks files in sha1dc/ that are simple copies of files from the sha1collisiondetection submodule, in case that submodule is not checked out), the model has simply served me too well to merit a change.

I'm really sorry, but even after all of this discussion, it seriously looks to me as if there are no better ideas available, and that we're no longer discussing options with the goal of improving the design but it looks rather like insisting on "one's own" idea. Dropping a "principle" in somewhat a hand-waving way is not particularly convincing. I see many upsides to caching the generated files by way of committing them, I see many more upsides in committing them to the same repository instead of one or more separate ones, and while I see one valid argument in disfavor ("contributors might open PRs that change generated files") I see this as eminently manageable a problem to have.

So I consider this discussion to have passed the point of being helpful. While we were discussing hypotheticals here, other discussions revolving around this PR have pointed out not only broken links but also tools that helped me fix all of them, discussions raising the valid concern of keeping existing links working, discussions around improving the search feature have shown themselves a lot more fruitful. So maybe there is a way to bring this here discussion back to something fruitful, maybe by pointing out some other valid concern around the current design than PRs accidentally changing generated files. Something that might actually cause problems.

I am opposed to complicating the build process, the GitHub workflows, the local development by drive-by contributors. That's a total non-goal for me. The best contender for an alternative to "generate-then-commit-the manual-pages-and-the-book-translations" so far was the idea to generate everything from scratch all the time. Which is a pretty awful idea once you tally up the time this would take and how tedious it would make it to contribute.

bglw commented 9 months ago

👋 @dscho sorry just catching up here

I received a report that the search results are somewhat counter-intuitive ... I tried to help Pagefind to report that manual page as first match for the search term "log" ... I thought giving the term "log" a weight of 10 would force it to be the top hit, but it is not even among the first 10 hits

Hmm, by the time you're at data-pagefind-weight="10" it should definitely be surfaced significantly higher — but currently there are no firm guarantees on how that flows through the rankings

I guess the fact that the other locations' balanced_scores taper off too quickly with the former to be considered the clear winner?

Yeah — one thing Pagefind looks for is pages with a high density of words — so the other page has a lot of logo — and it's a short page so the density is very high. /docs/git-log is substantially larger, so log is comparatively lower density. Such are the challenges of full-text search 😓 (and that factors in logo being deranked slightly since it isn't exactly log)

A relevant issue for this is https://github.com/CloudCannon/pagefind/issues/437 — this would help boosting titles (or other pieces of metadata) that match further up results.

Another thing that would help would be if Pagefind exposed more controls on how it ranks content — for example you could opt out of the density ranking to help here.

For right now — a good approach might be to go fully manual with your rankings. If you wrap your entire body in data-pagefind-weight="0.2" (or anything down to like 0.05) then your weights will all start very low. Having this weight wrap everything will also opt-out of doing anything automatic with headings, so an h1 won't get an automatic higher weight.

With that, a weight of 10 will be substantially higher — hopefully enough to push those pages right to the top. And you might want to put explicit weights on other content/headings as required. A weight of 10 is ~100x stronger than 1 — but a weight of 10 is ~50,000x stronger than 0.05.

If that doesn't pan out nicely, or causes other issues, let me know! I can fast-track one of the two approaches on Pagefind's end to improve this kind of reference search 🙂

dscho commented 8 months ago

For right now — a good approach might be to go fully manual with your rankings. If you wrap your entire body in data-pagefind-weight="0.2" (or anything down to like 0.05) then your weights will all start very low. Having this weight wrap everything will also opt-out of doing anything automatic with headings, so an h1 won't get an automatic higher weight.

With that, a weight of 10 will be substantially higher — hopefully enough to push those pages right to the top. And you might want to put explicit weights on other content/headings as required. A weight of 10 is ~100x stronger than 1 — but a weight of 10 is ~50,000x stronger than 0.05.

@bglw unfortunately, this still does not seem to help... I've wrapped the body with a ridiculously small weight, and then wrapped the synopsis with a slightly larger, but still ridiculously small weight:

<div
 id="main"
 data-pagefind-filter="category:documentation"
 data-pagefind-meta="category:Reference"
 data-pagefind-weight="0.000001"
 data-pagefind-body="">
[...]
<p
 data-pagefind-weight="0.000002">
git-<span data-pagefind-weight="10">commit</span> - Record changes to the repository<div></div>
</p>
[...]
</div>

However, searching for commit will always bring up git-verify-commit first, then git-get-tar-commit-id, then git-commit-graph, then git-commit-tree, and only as fifth hit: git-commit. It's almost as if having an exact match is punished.

Here is a copy of a Developer Tools Console session where I obtain the search results for commit, show the 5th hit's (git-commit's) weighted locations first, then the 1st hit's (git-verify-commit's), and then the scores of the first few hits:

» x = await Search.pagefind.debouncedSearch("commit")
←⏵ Object { results: (269) […], unfilteredResultCount: 269, filters: {}, totalFilters: {}, timings: (1) […] }
» JSON.stringify((await x.results[4].data()).weighted_locations.slice(0, 2), null, 2)
← '[
      {
        "weight": 10,
        "balanced_score": 57600,
        "location": 2
      },
      {
        "weight": 0.041666666666666664,
        "balanced_score": 1,
        "location": 11
      }
    ]'
» JSON.stringify((await x.results[0].data()).weighted_locations.slice(0, 2), null, 2)
← '[
      {
        "weight": 5,
        "balanced_score": 14400,
        "location": 2
      },
      {
        "weight": 0.041666666666666664,
        "balanced_score": 1,
        "location": 9
      }
    ]'  
» x.results.map(e => e.score)
←⏵Array(269) [ 8.579167, 1.2952586, 0.68904763, 0.6164191, 0.59042245, 0.22033088, 0.0050644567, 0.0034722222, 0.0026816516, 0.0024764733, … ] 

FWIW the numbers haven't changed a lot when I changed the two "outer" weights, verify-commit always got a score over 8 and commit always just over 0.59. I fear that the culprit is that <span>verify-commit</span> just always gets a score around 14.5k, which combined with the density makes it always win over <span>commit</span> even if that latter's score is around 57.5k (my guess is that the density of the git-commit page with a word count of 4079 is just so unfavorable compared to the git-verify-commit page with only 70 words, amirite?)

Help?

dscho commented 8 months ago

Another thing that would help would be if Pagefind exposed more controls on how it ranks content — for example you could opt out of the density ranking to help here.

I think having that option would be awesome!

dscho commented 8 months ago

Another thing that would help would be if Pagefind exposed more controls on how it ranks content — for example you could opt out of the density ranking to help here.

I think having that option would be awesome!

@bglw I tried my hand at it (documentation is still missing, I first want to make sure that this is the right direction): https://github.com/CloudCannon/pagefind/pull/534

dscho commented 8 months ago

I tried my hand at it (documentation is still missing, I first want to make sure that this is the right direction)

There's now documentation, and a test.

dscho commented 2 months ago

why does the generated content need to be versioned in the first place?

Efficiency and reproducibility. The same reason why there is not only https://github.com/jnavila/git-manpages-l10n/ but also https://github.com/jnavila/git-html-l10n (hint: the latter contains assets generated from the former).

to me this sounds like a complete nightmare. not cleanly separating the sources from the generated content is a recipe for undesired "special effects" of all kinds. and that's atop of obvious issues of working with the repo itself.

@ossilator unnecessarily-harsh language notwithstanding, I do see a grain of truth in that statement. Cleanly separating non-generated from generated content is clearly desirable.

I have to admit that I dreaded the huge amount of work to get this done, and the time it took did not disappoint: it took a lot of effort.

The end result is worth it, though, I believe: All generated content is cleanly separated into the external/ directory tree, to be precise:

Both subdirectories fan out into:

Also, I added instructive "DO NOT EDIT" comments to the generated content, in an attempt to avoid confusing contributors into opening PRs that modify said generated content directly.

ossilator commented 2 months ago

why does the generated content need to be versioned in the first place?

Efficiency

granted, though there is no imperative to use the same repo for that.

and reproducibility.

that's backwards. one versions things that one cannot (easily) reproduce.

in particular, you should snapshot your relevant inputs (both data and build tools) as-is, in separate repositories (of whatever kind). note that if these inputs are externally versioned and have reproducible builds, then technically the snapshot can be quite minimal, e.g. a sha1. but for efficiency, you would probably cache their output artifacts that are your direct inputs.

[...] All generated content is cleanly separated into the external/ directory tree, to be precise: [...]

that's awesome. but to me, this looks like what should be just the first step towards splitting off the build artifacts.

the way i would approach it, things would be laid out this way:

this layout enables you to see what exactly caused the output to change, because each input is versioned separately. and it is friendly to both local builds and rebuilds, and to GH workflows (the main repo's workflow would merely trigger the one in the build artifact repo; this is easy).

dscho commented 2 months ago

why does the generated content need to be versioned in the first place?

Efficiency

granted, though there is no imperative to use the same repo for that.

No. But it is much more convenient to have the same repository for that, in particular in a project that typically sees one-off contributions. No need to make the barrier of entry harder than absolutely necessary, I hop you do agree on that point at least.

and reproducibility.

that's backwards. one versions things that one cannot (easily) reproduce.

And that's exactly the case here. When so many different components are combined, it is hard to reproduce the exact same output. Take a different Hugo version, for example. Or an update in the AsciiDoctor tooling. Or a different Ruby version that has slightly different behavior. I hope you see how many things can result in even subtle differences, so that the best way to allow one-time contributors easily reproduce the exact same data shape locally as the actual homepage has is to cache these pre-rendered pages.

that's awesome. but to me, this looks like what should be just the first step towards splitting off the build artifacts.

I don't want the complexity of even more repositories. I find it unnecessary and counterproductive.

ossilator commented 2 months ago

But it is much more convenient to have the same repository for that,

not really. it optimizes one particular aspect, at the cost of others.

in particular in a project that typically sees one-off contributions. No need to make the barrier of entry harder than absolutely necessary, I hop[e] you do agree on that point at least.

i guess i'm atypical, but as a potential one-off contributor, i'd look, shake my head, and walk away if i saw a repo with 7 million lines of generated text.

When so many different components are combined, it is hard to reproduce the exact same output.

but your approach - unlike mine - doesn't make things more reproducible. it merely archives results from particular build setups, basically hoping that the next contributor won't have to rebuild anything.

fwiw, the proper way to snapshot the tooling/build environment would probably involve a docker image or something like that. having everything relevant in git repos would be a tad insane ...