Migrate git-scm.com to a static site, generated via Hugo, served via GitHub Pages

git / git-scm.com

The git-scm.com website. Note that this repository is only for the website; issues with git itself should go to https://git-scm.com/community.

https://git-scm.com/

MIT License

2.17k stars 1.22k forks source link

Migrate git-scm.com to a static site, generated via Hugo, served via GitHub Pages #1804

Open dscho opened 11 months ago

dscho commented 11 months ago

Changes

This Pull Request adjusts the existing files such that the site is no longer served via a Rails App, but by GitHub Pages instead. A preview can be seen here: https://dscho.github.io/git-scm.com/ (which is generated and deployed from this Pull Request's branch, and will be updated via automation whenever that branch changes).

It is the culmination of a very long, and large, effort that started in February 2017 with the first attempt to migrate the site to Jekyll. Several years, and a substantial effort by @spraints, @vdye and myself, later, here is the result: No longer a Jekyll site but a Hugo site (because of render times: 20 minutes vs 30 seconds), search implemented using Pagefind, links verified by Lychee.

The main themes of the subsequent migration from the Rails App to a Hugo-generated static site are:

We move the original Rails App files that contain Rails code mixed into HTML to content/, where the files defining the pages live in the Hugo world, then modify them to drop the Rails code and replace it with Hugo constructs. More often than not, we separate the commits that move the files from the commits that adjust the contents, to help Git realize that there has been a move (as opposed to a delete/add, Git's rename detection does have its shortcomings). This allows for noticing upstream changes that need to be reflected in moved & modified files when rebasing to upstream.
In Hugo setups, the files live in the following locations:
- hugo.yml
This is the central configuration file that tells Hugo how to render the site.
- layouts/
This is where the "boiler plate" is defined that ties the site together, i.e. the header, the footer and the sidebar as well as the main scaffolding in which the pages' content is to be rendered.

This is the location where most of Hugo's functionality is available and complex stuff can happen such as looping or accessing site parameters.
- layouts/partials/
This directory contains recurring templates, i.e. reusable partial layouts that are used to structure the elements of the site. This includes the side bar, how videos are rendered, etc.
- layouts/shortcodes/
This directory contains so-called "shortcodes", i.e. reusable elements similar to partial layouts. The major difference is that shortcodes can be used within content/ while partial layouts can only be used from within layouts/.

See https://gohugo.io/content-management/shortcodes/ for more information on this topic.
- content/
This defines the content of the pages that are served. Only a subset of Hugo's functionality is available here (the idea is to leave the complicated stuff to the layout used to render the pages). These files have the extension .html but need to be processed using Hugo before becoming proper HTML pages. For example, most of these files begin with so-called front matter, i.e. metadata relevant to Hugo, specified using YAML that is enclosed in --- lines.

To discern clearly between pages maintained in this repository vs HTML pages that are pre-generated using content from other repositories (such as the ProGit book and the manual pages), the pre-generated HTML pages are tracked in external/book/ and external/docs/, mapped via Hugo mounts. These files are not meant to be edited directly, and are clearly marked as such by comment at the top of the files, inside the front matter. Instead, these files are intended to be updated via GitHub workflows whenever the external repositories change.
- static/
These files are not processed by Hugo, but copied as-are. Good for images, for example.
- assets/
These files are processed in specific ways. That is where the SASS-based style sheets live, for example.
- data/
These files define metadata that can be used in Hugo's functions. For example, it contains the list of documentation categories that are rendered in various ways, and the GUIs that are shown at https://git-scm.com/downloads/guis are defined there.
In contrast to most Hugo-managed sites, we will refrain from using a Hugo theme, and instead stick with the existing style sheets.

Likewise, we refrain from using Markdown at all: The existing site did not use it, therefore it makes little sense to start using it now.
In addition to Hugo's directories, we also have these:
- script/
This directory contains scripts to perform recurring tasks such as pre-rendering Git's manual pages into HTML that are then stored inside external/docs/.

For historical reasons, these are Ruby scripts for the most part, as it is easier to follow the development when that functionality is extracted from the current Rails App and turned into Ruby scripts that can be run stand-alone.
- .github/workflows/ and .github/actions/
The latter directory contains a file that defines a custom GitHub Action that accommodates for the lack of Hugo support in GitHub Pages: By default, only Jekyll pages are supported out of the box, but Hugo sites require a custom GitHub workflow to deploy the site.

The former directory contains files that define GitHub workflows that are typically run on a schedule, updating the various parts that are generated from external sources: the Git version, the ProGit Book, manual pages, etc. These workflows essentially keep the rendered HTML files in content/ up to date with the respective external repositories.

These workflows can be seen in action (pun intended) here: https://github.com/dscho/git-scm.com/actions
- external/book/
It makes very, very little sense to render the ProGit book from scratch every time the site is deployed (and every time a PR build is run). To avoid that, one of the script/GitHub workflow pairs mentioned earlier populates and updates this directory with the latest version of the ProGit book.

The subdirectories of external/book/ recapitulate Hugo's standard layout: content/, data/, static/, and Hugo mounts map them into the Hugo project. The only exception to this rule is sync/, which contains .sha files reflecting the tip commits of the ProGit book and its translations.

Note: An alternative to this layout would have been to use submodules. However, the complexities, in particular in GitHub workflows, have been deemed not worth this approach and I opted for simplicity instead.

Also note: The files in external/ are not meant to be edited directly, and are therefore clearly marked as such by comment at the top of the files, inside the front matter. The comment indicates the script that was used to populate/update the content; This will hopefully direct contributors who are tempted to edit these generated files to the right place to make their changes.
- external/docs/
Like the book/ subdirectory, the docs/ subdirectory contains pre-rendered versions of Git's manual pages and their translations (which is particularly important here because rendering them from scratch would easily take 20 minutes), and it is populated and updated via scripts that are run in regularly-scheduled GitHub workflows.

Just like external/book/sync/, the external/docs/sync/ directory contains .sha files whose contents reflect the tip commits of the external repositories.

In addition, there is the external/docs/asciidoc/directory which serves as a cache of "expanded AsciiDoc": many of Git's manual pages include content from other files, and therefore it is non-trivial to determine whether or not a manual page has changed and needs to be re-rendered (essentially, the only way is to expand them by inlining the included files and then comparing the contents). Caching this content speeds up updating the manual pages drastically.
Most of the core logic lives in layouts/. Hugo discerns between logic that is allowed in layouts/ and logic that is allowed in content/; The latter can only access so-called "shortcodes". These shortcodes are essentially snippets of Hugo pages and are free to use the entire set of Hugo's functionality.

tl;dr whenever we need to do something complicated that is confined to only a few pages, we have to implement it in layouts/shortcodes/ and insert the corresponding {{< shortcode-name >}} in the page itself. Whenever we need to do something complicated that is used in more places, it is implemented elsewhere in layouts/.
Some of the logic that cannot be performed statically (such as telling the user how long ago the latest macOS installer was released, or adjusting the Windows downloads to reflect the CPU architecture indicated by the current user agent) are implemented using Javascript instead.
The site search needs to move to the client side, as there is no longer a server that can perform that service. Luckily, Pagefind matured in the meantime (I have helped, too), a very performant client-side search solution implemented in Javascript that makes use of a pre-computed, fine-grained search index that is loaded incrementally on demand.
In contrast to the Rails App, the static pages are easy to check for broken links. We use Lychee for that (which I helped support GitHub Pages better).

Context

Changes required to finalize the migration in addition to this Pull Request

This Pull Request is not actually meant to be merged, not to the main branch at least, but to be pushed to the gh-pages branch which then should be made the default branch.
To successfully deploy to GitHub Pages, the Pages configuration was already switched from "Deploy from a branch" to "GitHub Actions":
Once everything is golden in this Pull Request and the decision to move to GitHub Pages is final, git-scm.com needs to be pointed to GitHub Pages (read: CNAME needs to be configured to make use of the GitHub Pages-deployed site).
The Pull Request branch was actually pushed to gh-pages already, reflected by the preview that can be seen at https://git.github.io/git-scm.com/.

Why make these changes?

Heroku stopped their free tier and therefore https://git-scm.com/ has required sponsorship for a while, using funds that could be put to better use elsewhere. In the meantime, Heroku offered to sponsor Git again, but we now know that this can go away at any time without much prior warning.
Static sites are much easier to manage, and to develop. With this Pull Request, developing the site locally is as easy as checking out the repository and running hugo serve -w, then editing the files to your heart's extent.
Easier debugging. For example, the page https://git-scm.com/docs/git-remote/fr has a typo in the synopsis: git remote renom is not the correct Git command. This page is supposedly generated from the git-html-l10n repository but the typo does not exist there. It is quite unclear where the bug is, seeing as https://dscho.github.io/git-scm.com/docs/git-remote/fr does not show the bug. I am still flummoxed how this bug could be fixed, as I haven't found the culprit despite investigating for multiple hours. This type of bug will be much easier to fix in the Hugo site than in the current Rails App, where this bug persists to this day.

spraints commented 11 months ago

:tada: This is great! Thank you so much for picking this up! The demo site looks great!

bglw commented 11 months ago

👋 Sneaking in here with some thoughts from the search side!

On first interactions, the search has some notable issues compared to the production rails search, for a few reasons on both sides of the fence.

All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...
- The best fix here would be for you to omit the data-pagefind-body attribute from the numbered release pages, so that only /docs/git-rebase/ is indexed and returned
Titles definitely need stronger affinity here. A search for list on the rails site returns rev-list-description, git-rev-list, and rev-list-options as the top results. Pagefind's search is significantly more varied, with a lot of results for mailing lists and related items.
- CloudCannon/pagefind#437 is relevant and discussing much the same thing.
- I don't have an immediate solution for this but I would love to find one.
Typing rebase into the live search and hitting enter does not show the rebasing book result. Typing the query in does.
- This helped narrow down a bug — filed as CloudCannon/pagefind#478
The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

(Amazing work migrating this to Hugo! ❤️)

dscho commented 11 months ago

Oh wow, Mr Pagefind himself! I'm honored to meet you, @bglw!

The best fix here would be for you to omit the data-pagefind-body attribute from the numbered release pages, so that only /docs/git-rebase/ is indexed and returned

I kind of wanted to be able to find stuff in old versions that is no longer present in current versions. That's why I added https://github.com/dscho/git-scm.com/commit/e9fa9630417b075b4a136518ea4dfbc7a1e884f4).

Titles definitely need stronger affinity here. A search for list on the rails site returns rev-list-description, git-rev-list, and rev-list-options as the top results. Pagefind's search is significantly more varied, with a lot of results for mailing lists and related items.

Find a way to improve ranking of exact matches in page titles CloudCannon/pagefind#437 is relevant and discussing much the same thing.

I don't have an immediate solution for this but I would love to find one.

Excellent!

Typing rebase into the live search and hitting enter does not show the rebasing book result. Typing the query in does.

This helped narrow down a bug — filed as Pagefind needs to load index chunks based on the stemmed word CloudCannon/pagefind#478

Heh, thank you for that!

The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

Right, I had not worked on that because I hoped that the sorting by relevance would be "good enough"...

rimrul commented 11 months ago

About Heroku

Heroku stopped their free tier and ever since https://git-scm.com/ has required sponsorship whose funding could be put to better use elsewhere.

That is true, but here has been an update since that 2022 mail.

https://lore.kernel.org/git/ZRHTWaPthX%2FTETJz@nand.local/

Heroku has a new (?) program for giving credits to open-source projects. The details are below:

https://www.heroku.com/open-source-credit-program

I applied on behalf of the Git project on 2023-09-25, and will follow-up on the list if/when we hear back from them.

It does seem like the PLC is still in favor of moving to a static solution, though.

https://lore.kernel.org/git/ZRrfAdX0eNutTSOy@nand.local/

Biggest expense is Heroku - Fusion has been covering the bill

There's on and off work on porting from a Rails app to a static site: https://github.com/git/git-scm.com/issues/942

Dan Moore from FusionAuth has been providing donations

Ideally we are able to move away from using Heroku, but in the meantime we'll have coverage either from (a) FusionAuth, or (b) Heroku's new open-source credit system

About the preview:

Search

All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...

That is true. And in both the search results page as well as the little preview (<div id="search-results">) it's not visually obvious which result is the current version and which results are older versions. Maybe that could be improved by adding the version number to the page title for non-current versions? Or maybe a filter in the search results to exclude historical documentation? If we don't want to mangle the titles, pagefind would show the version number below the result if we configured it as metadata.

Minor issues

There are some broken links in the preview on https://dscho.github.io/git-scm.com/docs/ that lead to https://dscho.github.io/docs/ \

There's a broken link on https://dscho.github.io/git-scm.com/about/free-and-open-source/ to https://dscho.github.io/git-scm.com/trademark. On the live site that redirects from https://git-scm.com/trademark to https://git-scm.com/about/trademark (https://github.com/dscho/git-scm.com/pull/1)

The "Setup and Config" headline on https://dscho.github.io/git-scm.com/docs/ is blue in the preview, but not in the live site. This is not happening for me in local testing.

There's some redirect that swallows anchors. https://dscho.github.io/git-scm.com/docs/ links to https://dscho.github.io/git-scm.com/docs/git#_git_commands , which redirects to https://dscho.github.io/git-scm.com/docs/git/ instead of https://dscho.github.io/git-scm.com/docs/git/#_git_commands Looks like the slash-free version isn't possible with the GitHub pages/Hugo combination (https://github.com/gohugoio/hugo/issues/492). We should update these links to contain the slash from the beginning to avoid the redirect.(https://github.com/dscho/git-scm.com/pull/3)

https://dscho.github.io/git-scm.com/downloads/mac/ has an odd grammar issue that https://git-scm.com/download/mac doesn't. (https://github.com/dscho/git-scm.com/pull/2) It says

which was released about 2 year, on 2021-08-30.

https://git-scm.com/download/mac correctly says

which was released about 2 years ago, on 2021-08-30.

Also note the slight url change there from dowload to downloads. There is a redirect for that, though, so that should be fine.

rimrul commented 11 months ago

One additional note: There is a commit about porting the old 404 page, 18a3ac2, but I've only seen the generic GitHub pages 404 page on the preview in my testing.

rimrul commented 11 months ago

Switching to pagefind also changed search behaviour in another way.

The rails site always searches the english content. Pagefind defaults to what they call multilingual search, i.e. searching only pages in the same language as the one you're searching from. That's theoretically a usability improvement, but with the partial nature of our non-english content (availability of any given language can vary from man page to man page, the book exists in languages that don't have any man pages, everything else only exists in english), we might need a fallback to english here. Pagefind offers an option to force all pages to be indexed as english, but I think we can slightly abuse mergeIndex with language set to en for a better result.

dscho commented 10 months ago

The "Setup and Config" headline on https://dscho.github.io/git-scm.com/docs/ is blue in the preview, but not in the live site. This is not happening for me in local testing.

I managed to fix it via 2d0f6c80293192f7882914e7f6a683c60afe3159

dscho commented 10 months ago

All tagged releases are indexed, so a search for rebase returns /docs/git-rebase/ and /docs/git-rebase/2.41.0/ and /docs/git-rebase/2.23.0/ and ...

That is true. And in both the search results page as well as the little preview (<div id="search-results">) it's not visually obvious which result is the current version and which results are older versions.

Hmm. The more I think about it, the more I get convinced that the older versions of the manual pages should be excluded from the search, I thought it was a feature, but it looks as if it incurs more downsides than upsides.

pedrorijo91 commented 10 months ago

this was a major effort @dscho , thank you very much! sorry for the silence, but i've been busy with other stuff. in the meanwhile, and to ensure this effort wont be wasted, can you summarize what do you need to make this merge-ready?

what do you still need to tackle? where do you need help from other people? :)

dscho commented 10 months ago

can you summarize what do you need to make this merge-ready?

@pedrorijo91 Yes.

[x] The search needs some love:
- [x] exclude the manual pages of previous versions from the search instead of trying to demote them; It's just too confusing
- [x] in the "live search" (i.e. when typing in the search box on any page other than the search results page), we will want to reinstate the "Reference"/"Book" separation of the search results. I'm currently unsure how we can accomplish that.
[x] to make the URLs nicer by having no trailing slash (just like the existing Rails App), we will need to uglify the URLs.
[x] general QA:
- [x] ensure that current URLs would work after migration
- [x] e.g. /about#branching-and-merging, /about#staging-area etc
- [x] add test -z "$(git grep "\$href\|src$ *= *[\"']/")" to CI
[x] rebase to the latest main

The big blocker is the "live search" one.

dscho commented 10 months ago

Oh, and there's a ton of work still needed to address @rimrul's excellent feedback.

dscho commented 10 months ago

general QA:

ensure that current URLs would work after migration

e.g. /about#branching-and-merging, /about#staging-area etc

@pedrorijo91 TBH I would love to have help with that.

dscho commented 10 months ago

ensure that current URLs would work after migration

e.g. /about#branching-and-merging, /about#staging-area etc

@pedrorijo91 TBH I would love to have help with that.

I just realized that https://git-scm.com/about#branching-and-merging does not actually redirect to https://git-scm.com/about/branching-and-merging... so I guess this is a non-issue.

dscho commented 10 months ago

Typing rebase into the live search and hitting enter does not show the rebasing book result. Typing the query in does.

This helped narrow down a bug — filed as https://github.com/CloudCannon/pagefind/issues/478

@bglw I just tested this at https://dscho.github.io/git-scm.com/ and it seems to work as expected. Thank you!

The rails site live search has a nice Reference / Book split that would be great to recreate with filters, if possible.

Right, I had not worked on that because I hoped that the sorting by relevance would be "good enough"...

I worked on this (7142149b5, ddbbe381c and 08183b0b0) and it seems to work now. Could you please test?

dscho commented 10 months ago

@pedrorijo91 I believe that this is now ready for wider testing. Do you have any objections against me pushing this to gh-pages and enabling the Actions to deploy to https://git.github.io/git-scm.com/?

pedrorijo91 commented 10 months ago

i agree that's likely the best way to test the new website @dscho . kinda impossible to review this huge diff manually :D

dscho commented 10 months ago

@bglw wow, the innocuous release notes item "Fixed a bug, resulting in a (very) large improvement to the NodeJS Indexing API performance (~100x)." seems to have a profound impact. While it is definitely not a scientific experiment (read: take the numbers with a grain of salt), the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s. Well done!

dscho commented 10 months ago

i agree that's likely the best way to test the new website @dscho .

Thank you @pedrorijo91. It's live! https://git.github.io/git-scm.com/

kinda impossible to review this huge diff manually :D

Right, I should have clarified that the majority of the diff is in the generated pages that do not actually need to be reviewed because they come from external sources where they are reviewed already. For example, content/book/ and content/docs/ contain only one non-generated file: content/docs/_index.html. You can see that in the tree of the commit before all the generated pages were added by automated GitHub workflow runs: https://github.com/git/git-scm.com/tree/ef17ce6ee91e30aba30e37478104b4384d9142ea/content

bglw commented 10 months ago

the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s.

Interesting! That bug fix should only be affecting this NodeJS API — not npx usage — so it's either just an outlier run, or something else in this release has an outsized performance impact 🤔 In either case, glad to hear it's running a bit faster 😅

(edit: I think you just landed a much faster machine — the Hugo build time also dropped from 24s in your first link, to 16s in the second)

dscho commented 10 months ago

the latest run with Pagefind v1.0.3 took 148.456s and the first run with Pagefind v1.0.4 took only 106.626s.

Interesting! That bug fix should only be affecting this NodeJS API — not npx usage — so it's either just an outlier run, or something else in this release has an outsized performance impact 🤔 In either case, glad to hear it's running a bit faster 😅

Huh. So it might actually be a fluke. I just thought that npx, being a node.js way to generate the search index, would internally use the node.js API ;-)

(edit: I think you just landed a much faster machine — the Hugo build time also dropped from 24s in your first link, to 16s in the second)

Possible. I experienced something like that recently in a different context, where subtle differences between the large macos runners relative to the non-large ones caused git/git CI to fail (because Python2 was on the PATH in the non-large runners, but hidden in the large ones). So it's quite possible. Unfortunately, I do not see any breadcrumb in the logs to confirm or deny that the job is running on a large runner...

rybak commented 10 months ago

Bug report Has been fixed

~~HTML entities are rendered verbatim in version dropdowns in documentation reference.~~ Fixed now.

Bug report contents click to expand

### Steps to reproduce 1. Go to 2. Click on dropdown "Version 2.40.0 ▾" 3. Observe the dropdown between items for versions 2.40.0 and 2.39.0 ### Actual result The dropdown has small text in italics on the right `2.39.1 → 2.39.3 no changes` ### Expected result The dropdown has small text in italics on the right `2.39.1 → 2.39.3 no changes`

ossilator commented 10 months ago

it strikes me as a bad idea to commit all the generated content to the main repo. how about a submodule?

rimrul commented 10 months ago

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this.

https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

dscho commented 10 months ago

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this.

https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

Actually, that does not fix it for me, but this here diff does:

diff --git a/layouts/partials/ref/versions.html b/layouts/partials/ref/versions.html
index 3eca7a4c5..6ca23230f 100644
--- a/layouts/partials/ref/versions.html
+++ b/layouts/partials/ref/versions.html
@@ -28,7 +28,7 @@
         </a>
         </li>
       {{ else }}
-        <li class="no-change"><span>{{ $v.name }} no changes</span></li>
+        <li class="no-change"><span>{{ safeHTML $v.name }} no changes</span></li>
       {{ end }}
     {{ end }}
     <li>&nbsp;</li>

I'll commit it as a fixup for 0501ad1ad1b94f70821ab79fcfd0365ab2e5b3ae.

dscho commented 10 months ago

HTML entities are rendered verbatim in version dropdowns in documentation reference.

@rybak thank you for the detailed bug report!

it strikes me as a bad idea to commit all the generated content to the main repo. how about a submodule?

@ossilator Friends don't let friends use submodules.

Seriously speaking again, it seriously won't work with submodules because those pages need to be generated in GitHub workflows with write access to the repository (so that the changes can be pushed), which is not possible (or at least not in a way that makes it easy to contribute) in a workflow that is defined in a different repository. Besides, the generated files need to live in subdirectories of content/ that are not always completely generated. For example, content/docs/_index.html is not generated. And hugo.yml is partially re-generated (download data and Git version), so that generated data has to be in the same repository.

In addition to making it harder to contribute, submodules would also make the deployment to GitHub Pages more fragile because of the need to clone multiple repositories for the price of one.

No, I fear that the submodules idea is actually the bad idea, not the one to commit generated files in well-defined places ;-)

dscho commented 10 months ago

HTML entities are rendered verbatim in version dropdowns in documentation reference.

With some quick testing in my browsers dev tools, it seems like changing this <span> to a <div> would fix this. https://github.com/dscho/git-scm.com/blob/475b6f1039839b790e5a9f40dda45ca431e4f9e8/layouts/partials/ref/versions.html#L31

Actually, that does not fix it for me, but this here diff does:
diff --git a/layouts/partials/ref/versions.html b/layouts/partials/ref/versions.html
index 3eca7a4c5..6ca23230f 100644
--- a/layouts/partials/ref/versions.html
+++ b/layouts/partials/ref/versions.html
@@ -28,7 +28,7 @@
         </a>
         </li>
       {{ else }}
-        <li class="no-change"><span>{{ $v.name }} no changes</span></li>
+        <li class="no-change"><span>{{ safeHTML $v.name }} no changes</span></li>
       {{ end }}
     {{ end }}
     <li>&nbsp;</li>
I'll commit it as a fixup for 0501ad1.

This is now fixed (in 367254d3a) and deployed. Thank you @rybak!

ossilator commented 10 months ago

Besides, the generated files need to live in subdirectories of content/ that are not always completely generated. For example, content/docs/_index.html is not generated. And hugo.yml is partially re-generated (download data and Git version), so that generated data has to be in the same repository.

to me this sounds like a complete nightmare. not cleanly separating the sources from the generated content is a recipe for undesired "special effects" of all kinds. and that's atop of obvious issues of working with the repo itself.

why does the generated content need to be versioned in the first place? can't github just serve the build artifacts? for all i can tell you just need a simple configuration management system.

dscho commented 10 months ago

@ossilator I appreciate that you think about these issues.

But generating everything from scratch every time, that would be hell twice over. That's a nightmare. Too many things that could go wrong and testing locally would be another nightmare on top.

And your suggestion to use submodules actually gave me the creeps. I've been using submodules in the past and there are many good reasons why I don't do that anymore. I know many, many engineers with the same learning trajectory.

And honestly, I definitely do not understand why you're so averse against committing generated content. It makes so many things much easier, from easily being on the same page when two contributors are looking at the same generated page, to testing locally, to link checking, to running this in a GitHub workflow after a new Git version was released and expecting updates as quickly as possible.

Merely looking at re-generating all of the manual pages makes re-generating them all the time distinctly a total non-starter. That would be adding over 10 minutes to every single deployment, for work that really only needs to be done once!

No, in this instance, committing what has been generated, by automation that can be trusted and verified, you basically know at all times what you've got, there are no hidden surprises. You know from which progit2/progit2 commit this and that file was generated, and you can verify that it was generated correctly by re-running the script and calling git diff.

So from a practical point of view, if you want to accept this from a person who has worked on this project for over a year and hence has gained a lot of experience in this space (i.e. me), committing the generated content in a well-defined way, to well-defined locations within the same repository, is making everything a lot less painful than it would otherwise be.

And if you're still not convinced, I would love to be presented with hard evidence (read: not just talk) that stands a chance of convincing me that your suggestion should be preferred over the current proposal.

vdye commented 10 months ago

@ossilator I appreciate that you think about these issues.

But generating everything from scratch every time, that would be hell twice over. That's a nightmare. Too many things that could go wrong and testing locally would be another nightmare on top.

And your suggestion to use submodules actually gave me the creeps. I've been using submodules in the past and there are many good reasons why I don't do that anymore. I know many, many engineers with the same learning trajectory.

I know you're passionate about this topic, but there's no need for hostility (re: "gave me the creeps"). The suggestion of submodules seemed more the result of some initial brainstorming w.r.t avoiding the storage of generated files in the repo. It's a starting point for a conversation, not a firm design proposal.

And honestly, I definitely do not understand why you're so averse against committing generated content. It makes so many things much easier, from easily being on the same page when two contributors are looking at the same generated page, to testing locally, to link checking, to running this in a GitHub workflow after a new Git version was released and expecting updates as quickly as possible.

As someone that maintained a repository in the past with a similar concentration of generated files, there are a number of things it can make harder as well:

It's difficult to enforce "don't update the generated files" (even with "DON'T UPDATE THIS FILE" banners, READMEs, etc.), so when people inevitably try it, it leads to more time spent going back-and-forth on pull requests.
It's not necessarily straightforward (esp. for new contributors) to figure out which file(s) need to be changed to update something they see in a given generated file (is it the main body of the file? header or footer? etc.).
Changes in the generation process can result in massive diffs across the repository that don't add any real value.
The generated files are usually not updated in the commit that prompted the change, making debugging/bisecting more difficult.
Subjectively, I don't see generated files as any different from other build artifacts (e.g. binaries), and I consider "storing build artifacts alongside source code" generally bad practice (muddles the concept of "source of truth").

There are probably more specific issues I'm not remembering but, overall, I can say that maintaining a ton of generated files was indeed a nightmare for myself and other developers. The only reason I didn't jettison them when I was maintainer is that I never got the time to update the tooling accordingly.

All that said, you've made some valid points as to why we should store generated files. So IMO the decision of whether or not to commit generated files is fairly nuanced, and warrants discussion & possibly further investigation before settling on an approach.

Merely looking at re-generating all of the manual pages makes re-generating them all the time distinctly a total non-starter. That would be adding over 10 minutes to every single deployment, for work that really only needs to be done once!

10 minutes doesn't seem too bad, to be honest. But one way to avoid that while still keeping generated files out of the repo could be to store them as artifacts (e.g., a tarball of the generated files) tied to a given commit hash in the artifact storage of your choice, then use that as a sort of "pre-build" of the repository.

No, in this instance, committing what has been generated, by automation that can be trusted and verified, you basically know at all times what you've got, there are no hidden surprises. You know from which progit2/progit2 commit this and that file was generated, and you can verify that it was generated correctly by re-running the script and calling git diff.

So from a practical point of view, if you want to accept this from a person who has worked on this project for over a year and hence has gained a lot of experience in this space (i.e. me), committing the generated content in a well-defined way, to well-defined locations within the same repository, is making everything a lot less painful than it would otherwise be.

As someone that has also worked on this project (albeit not as extensively), I'm not convinced that committing generated content is the right way to go. Personal experience is valuable in informing your opinions, but it is not on its own a justification for the correctness of your approach, and it's definitely not cause to dismiss @ossilator's (or anyone else's) concerns out of hand.

And if you're still not convinced, I would love to be presented with hard evidence (read: not just talk) that stands a chance of convincing me that your suggestion should be preferred over the current proposal.

It's generally the job of the person developing a change to convince reviewers to accept that change, not the other way around. Reviewers can certainly help that process by providing technical justification when they disagree with an approach, but it's nevertheless important to understand & address concerns so that we reach a consensus based on technical merit. After all, wouldn't it be better for everyone if alternatives are thoroughly explored? If they don't end up better than what you have now, at least everyone will understand why we settled on a given approach. And if it is better than what you have now, then we end up with...something better!

I know that kind of exploration takes time, and you've already put a lot of time into this, so what I'm asking is probably more frustrating than not. But with such a massive change to such a valuable resource, it's critical that concerns are thoroughly addressed before moving forward on merging/deploying.

dscho commented 10 months ago

How would we make it easy to work with artifacts attached to commits, especially on PR branches?

I really like the simplicity of pushing to my fork and having a deployed site after the workflow run is done. Minimal surface for network issues because only one repository is checked out. And I can't think of any way to make it as simple without committing the generated files, I'm sorry.

dscho commented 10 months ago

I should also mention that I relied heavily on sparse checkouts (non-cone mode) to develop these changes. That gave me a very small section of the generated files to work with, accelerating the hugo/check/modify cycle. Also something I can't see being as convenient in any other setup.

I did think about separating better between generate vs non-generated files, via url entries in the front matter and then having a content/generated/ cone. It conflicted with my mental model, though.

As I rebased literally hundreds of times, submodules would have made my life so much worse, so I thanked myself for dismissing that idea (also because, as stated earlier, it would make automation more complex and fragile, not to mention development in PR branches).

Besides, the Rails App kept the generated files in a database. We're simply doing the same here, using Git as the database.

dscho commented 10 months ago

I definitely do not understand why you're so averse against committing generated content. It makes so many things much easier, from easily being on the same page when two contributors are looking at the same generated page, to testing locally, to link checking, to running this in a GitHub workflow after a new Git version was released and expecting updates as quickly as possible.

As someone that maintained a repository in the past with a similar concentration of generated files, there are a number of things it can make harder as well:

It's difficult to enforce "don't update the generated files" (even with "DON'T UPDATE THIS FILE" banners, READMEs, etc.), so when people inevitably try it, it leads to more time spent going back-and-forth on pull requests.

That's a valid point, and it is a social "problem" that I consider requiring social solutions: communication is key. If a contributor updates generated files, a kind redirect to the sources of those generated files is required (read: a friendly reply by a reviewer).

It's not necessarily straightforward (esp. for new contributors) to figure out which file(s) need to be changed to update something they see in a given generated file (is it the main body of the file? header or footer? etc.).

That is true.

Compared to the current situation in the Rails App, though, I would assume that everyone can agree that the process proposed in this here PR is a net improvement:

The trail of generated files can be traced via git log.
The commits that generate content/book/* state that they are "Updated via the update-book.yml GitHub workflow.", see e.g. 61e96be9c73b726729b9aab56af4197ee4a193e6
- You've reminded me to double-check that all commits created by automation do something similar, and saw that three workflows needed fixing, so I did that now: 50cf89e58, adea480ae, 552b84276. Thank you for the reminder, I would have missed this before "merging" the PR.
While it is not straight-forward in most typical scenarios to determine, from reading the definition of the GitHub workflow that created or modified a file, what source file is responsible for, say, a problem in a header or footer, at least it is now much more straight-forward to run the same command to re-generate the file(s) in question, simply by imitating the workflow's commands.
Whereas the Rails App may leave some logs when generating content (who knows? and who would know how to access them? who would have permission?), committing the generated files leaves a concrete, exact audit trail. There are now only three components to tracing when and why content was updated: the commit history of this here repository, the workflow run to deploy it, and the web site itself. All three are public, which is different from the Rails App where only the content is public, and the Date: header of the HTTP response does not reflect when the page was changed last.
As with the Rails App, most fixes' investigation will have to start with a git grep to find out what files need to be changed. For example, if a contributor desires to change the --local-branching-on-the-cheap tag line, or add a new one, they would need to start with git grep local-branching-on-the-cheap to find the definition.

Changes in the generation process can result in massive diffs across the repository that don't add any real value.

I would disagree with the assessment about the value, given that I very frequently cross-validated fixes in script/*.rb by studying the diff of the generated files, something that would have been substantially harder with the current Rails App.

The generated files are usually not updated in the commit that prompted the change, making debugging/bisecting more difficult.

I can see your point about that.

You would be able to bisect changes in the generated output, but at the end would be pointed to the commit that persists those changes, and you would then have to find out which preceding commit changed the generated files before that, and then you would end up with a commit range of potential changes that are the root cause of those changes.

Or even worse, the source of the change might be outside of this repository, e.g. when a typo in a translated manual page was fixed (naturally, that would happen in https://github.com/jnavila/git-html-l10n instead of https://github.com/git/git-scm.com).

Yet again I want to point out, though, that committing the generated files is a net improvement: With the Rails App you may feel gas-lit about a problem you saw yesterday but that is no longer present today, with no record that the problem was there before and has been fixed. At least now we have a public record.

Subjectively, I don't see generated files as any different from other build artifacts (e.g. binaries), and I consider "storing build artifacts alongside source code" generally bad practice (muddles the concept of "source of truth").

While I agree in general with the paradigm to separate clearly between concerns (in this case, "source of truth" and "user-visible content"), I want to offer the pragmatic insight that in this scenario we basically add a caching layer.

Consider what the Rails App does with storing the generated content in the database: You could change it such that every time a user asks for, say, https://git-scm.com/docs/git-add, the App would determine the latest version from https://api.github.com/repos/git/git/tags (cannot use https://api.github.com/repos/git/git/releases/latest because Git does not publish releases, only tags), then pull the source from https://github.com/git/git/blob/$version/Documentation/git-add.txt, render it via AsciiDoctor (diligently retrieving the included files from the repository, too), and then, without ever caching, deliver this to the user. This would be arguably "more correct" than what the Rails App does right now. However, you will certainly agree that this would not only be slow due to an abundance of network handshakes, but would add a lot of fragility to the process, as the likelihood of network problems causing issues rises with the square of the number of network requests.

Now, you could argue that the caching layer shouldn't be the Git repository. But it is a database. And it is the most easily accessible caching layer in the context of this repository and a static website.

There are probably more specific issues I'm not remembering but, overall, I can say that maintaining a ton of generated files was indeed a nightmare for myself and other developers. The only reason I didn't jettison them when I was maintainer is that I never got the time to update the tooling accordingly.

All that said, you've made some valid points as to why we should store generated files. So IMO the decision of whether or not to commit generated files is fairly nuanced, and warrants discussion & possibly further investigation before settling on an approach.

Well, I can only reiterate that my work on this branch would have been substantially harder if the generated files had not been committed. From the prohibitive amount of time to generate and re-generate content, to avoiding network issues in the GitHub workflow runs, to being able to work on a sparse checkout to focus on the Hugo-related processing of a very small part of the generated content, to being able to verify fixes by looking at the diff (or better put, by letting sometimes elaborate commands do the verification for me; I cannot count the times I ran git diff @{1} -- ':(exclude)_gen*' ':(exclude)content/book' ':(exclude)content/docs' ':(exclude)data/docs.yml' ':(exclude)data/book-*' ':(exclude)_sync*' ':(exclude)static/book', for example, to verify that only the expected files were updated by a particular script invocation), to staving off unwelcome surprises stemming from changes in the progit* repositories between workflow runs, I cannot stress enough how much the simplicity and reliability of committing the generated content has helped me out.

Merely looking at re-generating all of the manual pages makes re-generating them all the time distinctly a total non-starter. That would be adding over 10 minutes to every single deployment, for work that really only needs to be done once!

10 minutes doesn't seem too bad, to be honest.

10 minutes may not seem so bad, but it would be tripling the wall-clock time of each deployment.

Not to mention the overall build minutes we would essentially waste: keep in mind that we are building the ProGit book for 30 languages. Sure, we could parallelize that in a matrix job (as I have done in update-book.yml). But do keep in mind that this adds to the fragility of the workflow runs, and that the free plan allows for "only" 20 concurrent jobs, adding to the wall-clock time.

Besides, it is 100% against my values to waste build minutes, even if we do not have to pay for them. The planet is burning, and I want no part in contributing to that, not even "small" things like re-running jobs many times that could have run only once if only their output had been persisted.

But one way to avoid that while still keeping generated files out of the repo could be to store them as artifacts (e.g., a tarball of the generated files) tied to a given commit hash in the artifact storage of your choice, then use that as a sort of "pre-build" of the repository.

I actually had thought about that, and even had thought about using Git LFS.

The idea of attaching the generated content to a certain commit was enticing at first, yet to me it appears not viable:

Since the source files' content from repositories such as https://github.com/jnavila/git-html-l10n, https://github.com/git/git, https://github.com/progit2-tl/progit2, etc needs to be processed into a shape conducive to how https://git-scm.com/ wants to present it, we do have to have scripts in this here repository to perform that processing. That means that all generated content depends on two commits, in two repositories: the 3rd-party repository and git/git-scm.com. Which one to attach to? We woul d have to attach the content to the tuple of two commits.
Local testing is a serious nut to crack if generated content needs to be attached to a commit.
Any solution would end up doing essentially the same thing: persisting the generated output somewhere, somehow associated with some commit. The difference between all solutions I could come up with boils down to simplicity, with the approach to simply commit into the same repository being the simplest by far.

I also dismissed the idea of using Git LFS after realizing that we're not talking about a few big files, but about many, many small files.

I know that kind of exploration takes time, and you've already put a lot of time into this, so what I'm asking is probably more frustrating than not. But with such a massive change to such a valuable resource, it's critical that concerns are thoroughly addressed before moving forward on merging/deploying.

I apologize for letting my frustration show with the suggestion to use submodules.

When I saw a mere one-liner of a suggestion to use submodules, I immediately thought that such a proposal should have been backed up by a lot more consideration, and certainly be accompanied by some sort of discussion demonstrating that the suggester thought at least a couple of minutes about the ramifications and be presented in good faith, i.e. with concrete upsides in mind. But that's just an explanation for my response, not an excuse. Again, I sincerely apologize.

Had I been a bit more level-headed, I would have realized that there are many people who haven't (yet) been exposed to really bad experiences with submodules, not everybody has to deal with the kinds of problems I, in my role as Git for Windows maintainer, am regularly exposed to. And I agree that the concept of submodules seems quite elegant, on paper. At least that was my initial thought when they were introduced into Git.

Sadly, very sadly, my initial assessment had no chance of surviving. And it is definitely not just personal experience. Yes, I used submodules extensively in 2007, that is true, and I ran into so many problems (with rebasing, the inadequacy of git status with submodules, the complete nightmare of stacked git commit calls, the fragility of submodules' commits being lost forever due to forced pushes, the joy of switching branches in particular with uncommitted changes in submodules, just to name a few, and those challenges are as unresolved today as they have been back then) that I saw myself forced to go through a painful process of replacing these submodules.

But that is nothing, really nothing compared to the experience I am exposed to by virtue of assisting users, including enterprise customers, with Git issues.

The same issues and challenges and problems are reported to me over and over and over again. In addition to the ones I experienced personally, there are oh so many issues with collaboration: The lack of a streamlined submodule experience in Git seemingly causes an endless stream of frustrations where users forget to commit in submodules, or in superprojects, forget to push submodules, or push them to repositories collaborators cannot access, or cannot push because they have write permission only for a fraction of the relevant repositories, where friction is caused by git clone being completely happy to default to non-recursive clones while the build processes totally break down except with full recursive clones, where elaborate build processes have to be built that work around the discrepancies caused by Git working non-recursively by default for pretty much all operations and those complexities naturally inviting their own set of subtle bugs.

In my experience, the only people suggesting to use submodules are either people who were forced to work with submodules for such a long time that they built muscle-memory, or tooling, or both, to work around (and probably even forget about) submodules' shortcomings, or (much more often) people who do not use submodules themselves.

There are really good reasons why monorepos are not only a thing, but why you read many, many reports where serious Git users switched from using submodules to using monorepos instead, and the accounts to do the reverse are few and far between. So: don't take this from me, take this from the majority of Git power users.

Looking concretely at the git-scm.com requirements, I see many similarities to the challenges I mentioned above. Elaborate build processes would be needed to ensure integrity (you don't want deployments to use the wrong version of generated content). That integrity would have to be broken for local development (you want to use just-generated content and not be forced back to the submodules' HEAD). Diffing between generated output would be non-trivial. Determining provenance of given content would involve not only two repositories (for example, docs/git-add/fr involves jnavila/git-html-l10n and git/git-scm.com), but with submodules there would be another one. If PRs trying to change a generated file are a challenge when the generated files are committed into the same repository, PRs trying to change a file in a submodule containing only generated files are an even bigger challenge. It would be way too easy to work with stale submodules by mistake. You get the picture.

So hopefully it is clear now why I am opposed to using submodules, and hopefully I presented the evidence well to back that stance up. There are really, really good reasons not to use submodules in this context. And I do not see any good reason in favor of using submodules. If anybody sees one good reason, please offer me the chance to refute it with evidence.

dscho commented 10 months ago

I cannot count the times I ran git diff @{1} -- ':(exclude)_gen*' ':(exclude)content/book' ':(exclude)content/docs' ':(exclude)data/docs.yml' ':(exclude)data/book-*' ':(exclude)_sync*' ':(exclude)static/book'

Hmm. I just had an idea, even if I do not know how practical it is: would it help y'all if we were to introduce a generated/ directory and put all of the above, i.e. generated book and manual pages, their YAML data, sync state, cached expanded AsciiDoc and images.

Caveat: I do not know enough Hugo internals yet to assess whether that is even possible. But I'd like to know if that would alleviate some of the concerns y'all had before even looking into it.

ossilator commented 10 months ago

the qt project uses submodules, and from what i've experienced, it's not as painful as you suggest. esp. for an "artifact cache" repo with a one-way relationship to the main repo it shouldn't be that bad. also, as @vdye correctly assumed, you don't need to take the "submodule" suggestion in the literal git sense. any separate storage will do, including a git repo that isn't actually a submodule.

anyway, i find it somewhat hard to believe that you can't come up with a sensible build versioning scheme with easy diffing. what you need sounds like a bog-standard CI/CD workflow, and i'd be a tad surprised if github didn't support it adequately.

note that artifact persistence/caching doesn't have to be the same mechanism as the build snapshotting mechanism (which would serve as the basis for versioning and diffing) - you can have a content-addressable cache from which files are hard-/ref-linked into the output tree, just like ccache works.

always starting out with a clean output tree is a huge boon for reproducibility, and therefore bisect.

dscho commented 9 months ago

i find it somewhat hard to believe that you can't come up with a sensible build versioning scheme with easy diffing.

Well, I did. You're looking at it.

ossilator commented 9 months ago

Well, I did. You're looking at it.

one that also doesn't violate the "don't put build artifacts in the source repo" principle ...

dscho commented 9 months ago

Well, I did. You're looking at it.

one that also doesn't violate the "don't put build artifacts in the source repo" principle ...

In the context of GitHub Pages, which literally suggests to commit and push the verbatim .html pages to serve, insisting on that principle would not seem to be a particularly tenable position to hold.

I could see the point of a much more pragmatic suggestion to use module mounts so that files generated by a particular script, or files generated from a particular repository, could be put into, say, generated/<name>/ and mapped back into data, content, static etc. I did verify that that would work for multiple data directories, i.e. something like this

module:
  mounts:
  - source: data
    target: data
  - source: generated/data
    target: data

would allow docs.yml to live in generated/data/ and still be accessed via $.Site.Data.docs.<whatever>. Somewhat surprisingly, this configuration follows the "first one wins" approach: if there is a docs.yml in data/ as well as in generated/data/, and both files contain the same key albeit with different values, with the above configuration Hugo would resolve to the value specified in data/docs.yml and ignore that generated/data/docs.yml defines a different value.

However, this solution is somewhat in search of a problem, and it is not particularly simple, either. Kind of a complicator's glove: a solution for a perceived problem that requires its own set of follow-up changes that introduce their own set of problems. For example, assuming that one would want to have different cones for, say, each ProGit book translation, the module: section would become quite long (and error-prone!) as there would have to be at least three entries per translation: data, content and static would need to be mounted for certain. Combined with the added complexity the module mounts would require in the scripts, I frankly do not see any benefit in pursuing that route any further.

dscho commented 9 months ago

@bglw I received a report that the search results are somewhat counter-intuitive, and I think the reason is the somewhat unique way Git's manual pages present the name of the command they are describing: Instead of having the command name in a header, the description "NAME" is in a <h2>, and that header is then followed by a paragraph that is in a <div> contained in another <div>, and that paragraph contains not only the name of the command, but also a very short description.

For example, the beginning of the original .html generated for git log's manual page looks like this (indentation added for clarity):

<div class="sect1">
 <h2 id="_name"><a class="anchor" href="#_name"></a>NAME</h2>
 <div class="sectionbody">
  <div class="paragraph">
   <p>git-log - Show commit logs</p>
  </div>
 </div>
</div>

The way I tried to help Pagefind (fa3f045b7b5c6a23a8c25bc0bdfac03a671e85d0) to report that manual page as first match for the search term "log" turns that into the following, via Hugo:

<div class="sect1">
 <h2 id="_name"><a class="anchor" href="#_name"></a>NAME</h2>
 <div class="sectionbody">
  <div class="paragraph">
   <p data-pagefind-weight="8">git-<span data-pagefind-weight="10">log</span> - Show commit logs</p>
  </div>
 </div>
</div>

However, that does not seem to accomplish what I want it to accomplish, I thought giving the term "log" a weight of 10 would force it to be the top hit, but it is not even among the first 10 hits:

Help?

dscho commented 9 months ago

I thought giving the term "log" a weight of 10 would force it to be the top hit, but it is not even among the first 10 hits

In the Developer Tools' Javascript console on the page https://dscho.github.io/git-scm.com/docs/git-log, when I run x = await Search.pagefind.debouncedSearch("log"), I get /downloads/logos.html as first result, and only as the fifteenth hit git-log.html. Here is what x.results[0] looks like:

{ id: "en_6ca1bcb", score: 38.458725, words: (9) […], data: async data() }

Here is what x.results[14] looks like:

{ id: "en_bca381f", score: 0.3501713, words: (106) […], data: async data() }

And here is the beginning of (await x.results[0].data()).weighted_locations:

[
  {
    "weight": 7,
    "balanced_score": 51899.676,
    "location": 0
  },
  {
    "weight": 4,
    "balanced_score": 16946.832,
    "location": 4
  },
  {
    "weight": 4,
    "balanced_score": 16946.832,
    "location": 18
  },
  {
    "weight": 4,
    "balanced_score": 16946.832,
    "location": 33
  },
  {
    "weight": 4,
    "balanced_score": 16946.832,
    "location": 48
  },
  [...]
]

And here is the beginning of (await x.results[14].data()).weighted_locations:

[
  {
    "weight": 10,
    "balanced_score": 57600,
    "location": 2
  },
  {
    "weight": 8,
    "balanced_score": 36864,
    "location": 6
  },
  {
    "weight": 1,
    "balanced_score": 576,
    "location": 9
  },
  {
    "weight": 1,
    "balanced_score": 576,
    "location": 18
  },
  {
    "weight": 1,
    "balanced_score": 576,
    "location": 134
  },
  [...]
]

So despite the first score of the fifteenth hit being higher than the first score of the first hit, I guess the fact that the other locations' balanced_scores taper off too quickly with the former to be considered the clear winner?

ossilator commented 9 months ago

GitHub Pages [...] literally suggests to commit and push the verbatim .html pages to serve

this is the most basic example possible. it's not useful to refer to it here.

if you still wanted to take it literally, you'd be working with two separate projects: one with the sources, with actions, and one with the generated content for serving. but you implied that github wouldn't make this easy to automate.

However, this solution is somewhat in search of a problem,

it's not, for all the reasons @vdye pointed out, and probably more.

and it is not particularly simple,

because you seem to be looking for ways to modify things just slightly, rather than rethinking the model. i can't quite believe that hugo doesn't offer an adequate solution to a problem that literally every user who tries to practice "versioning hygiene" must face.

dscho commented 9 months ago

i can't quite believe that hugo doesn't offer an adequate solution to a problem that literally every user who tries to practice "versioning hygiene" must face.

Hugo has obviously nothing to do with this. It merely expects its input in Markdown or HTML form, and processes it by interpreting the layouts.

Nothing in the standard Hugo model expects input in AsciiDoc format. Or input that is provided in a separate repository.

So we're in a somewhat rare scenario where we not only render AsciiDoc, but we also want to import the sources from 3rd-party repositories that we do not control.

For what it's worth, I have thought long and hard about ways to cache the rendered AsciiDoc better. It is true that I never even considered submodules because they're just not even a good fit for git/git itself, so I honestly can't take that suggestion seriously. But I have considered GitHub workflow artifacts (fragile, and hostile to local development), .zip files (again, not friendly to local development), Git LFS (not applicable because we don't work with just a couple huge files but instead with loads of small files), I did think about a SQLite database (but where would that be stored? And again, hostile to local development), and about GitHub Packages (strikes me as quite wasteful because no "release" is interesting after a newer one is published, and once again, hostile to local development). I won't mention some more obscure methods to avoid embarrassment. And lastly, I also considered re-generating everything from scratch, but dismissed it as laughably ridiculous that any contributor would need to generate everything from scratch for >15 minutes every time they need to test a change (even for simple typo fixes of the front page!!!).

These considerations happened almost two years ago, and after settling on the current setup (tracking generated files, much like git/git tracks files in sha1dc/ that are simple copies of files from the sha1collisiondetection submodule, in case that submodule is not checked out), the model has simply served me too well to merit a change.

I'm really sorry, but even after all of this discussion, it seriously looks to me as if there are no better ideas available, and that we're no longer discussing options with the goal of improving the design but it looks rather like insisting on "one's own" idea. Dropping a "principle" in somewhat a hand-waving way is not particularly convincing. I see many upsides to caching the generated files by way of committing them, I see many more upsides in committing them to the same repository instead of one or more separate ones, and while I see one valid argument in disfavor ("contributors might open PRs that change generated files") I see this as eminently manageable a problem to have.

So I consider this discussion to have passed the point of being helpful. While we were discussing hypotheticals here, other discussions revolving around this PR have pointed out not only broken links but also tools that helped me fix all of them, discussions raising the valid concern of keeping existing links working, discussions around improving the search feature have shown themselves a lot more fruitful. So maybe there is a way to bring this here discussion back to something fruitful, maybe by pointing out some other valid concern around the current design than PRs accidentally changing generated files. Something that might actually cause problems.

I am opposed to complicating the build process, the GitHub workflows, the local development by drive-by contributors. That's a total non-goal for me. The best contender for an alternative to "generate-then-commit-the manual-pages-and-the-book-translations" so far was the idea to generate everything from scratch all the time. Which is a pretty awful idea once you tally up the time this would take and how tedious it would make it to contribute.

bglw commented 9 months ago

👋 @dscho sorry just catching up here

I received a report that the search results are somewhat counter-intuitive ... I tried to help Pagefind to report that manual page as first match for the search term "log" ... I thought giving the term "log" a weight of 10 would force it to be the top hit, but it is not even among the first 10 hits

Hmm, by the time you're at data-pagefind-weight="10" it should definitely be surfaced significantly higher — but currently there are no firm guarantees on how that flows through the rankings

I guess the fact that the other locations' balanced_scores taper off too quickly with the former to be considered the clear winner?

Yeah — one thing Pagefind looks for is pages with a high density of words — so the other page has a lot of logo — and it's a short page so the density is very high. /docs/git-log is substantially larger, so log is comparatively lower density. Such are the challenges of full-text search 😓 (and that factors in logo being deranked slightly since it isn't exactly log)

A relevant issue for this is https://github.com/CloudCannon/pagefind/issues/437 — this would help boosting titles (or other pieces of metadata) that match further up results.

Another thing that would help would be if Pagefind exposed more controls on how it ranks content — for example you could opt out of the density ranking to help here.

For right now — a good approach might be to go fully manual with your rankings. If you wrap your entire body in data-pagefind-weight="0.2" (or anything down to like 0.05) then your weights will all start very low. Having this weight wrap everything will also opt-out of doing anything automatic with headings, so an h1 won't get an automatic higher weight.

With that, a weight of 10 will be substantially higher — hopefully enough to push those pages right to the top. And you might want to put explicit weights on other content/headings as required. A weight of 10 is ~100x stronger than 1 — but a weight of 10 is ~50,000x stronger than 0.05.

If that doesn't pan out nicely, or causes other issues, let me know! I can fast-track one of the two approaches on Pagefind's end to improve this kind of reference search 🙂

dscho commented 8 months ago

For right now — a good approach might be to go fully manual with your rankings. If you wrap your entire body in data-pagefind-weight="0.2" (or anything down to like 0.05) then your weights will all start very low. Having this weight wrap everything will also opt-out of doing anything automatic with headings, so an h1 won't get an automatic higher weight.

With that, a weight of 10 will be substantially higher — hopefully enough to push those pages right to the top. And you might want to put explicit weights on other content/headings as required. A weight of 10 is ~100x stronger than 1 — but a weight of 10 is ~50,000x stronger than 0.05.

@bglw unfortunately, this still does not seem to help... I've wrapped the body with a ridiculously small weight, and then wrapped the synopsis with a slightly larger, but still ridiculously small weight:

<div
 id="main"
 data-pagefind-filter="category:documentation"
 data-pagefind-meta="category:Reference"
 data-pagefind-weight="0.000001"
 data-pagefind-body="">
[...]
<p
 data-pagefind-weight="0.000002">
git-<span data-pagefind-weight="10">commit</span> - Record changes to the repository<div></div>
</p>
[...]
</div>

However, searching for commit will always bring up git-verify-commit first, then git-get-tar-commit-id, then git-commit-graph, then git-commit-tree, and only as fifth hit: git-commit. It's almost as if having an exact match is punished.

Here is a copy of a Developer Tools Console session where I obtain the search results for commit, show the 5th hit's (git-commit's) weighted locations first, then the 1st hit's (git-verify-commit's), and then the scores of the first few hits:

» x = await Search.pagefind.debouncedSearch("commit")
←⏵ Object { results: (269) […], unfilteredResultCount: 269, filters: {}, totalFilters: {}, timings: (1) […] }
» JSON.stringify((await x.results[4].data()).weighted_locations.slice(0, 2), null, 2)
← '[
      {
        "weight": 10,
        "balanced_score": 57600,
        "location": 2
      },
      {
        "weight": 0.041666666666666664,
        "balanced_score": 1,
        "location": 11
      }
    ]'
» JSON.stringify((await x.results[0].data()).weighted_locations.slice(0, 2), null, 2)
← '[
      {
        "weight": 5,
        "balanced_score": 14400,
        "location": 2
      },
      {
        "weight": 0.041666666666666664,
        "balanced_score": 1,
        "location": 9
      }
    ]'  
» x.results.map(e => e.score)
←⏵Array(269) [ 8.579167, 1.2952586, 0.68904763, 0.6164191, 0.59042245, 0.22033088, 0.0050644567, 0.0034722222, 0.0026816516, 0.0024764733, … ]

FWIW the numbers haven't changed a lot when I changed the two "outer" weights, verify-commit always got a score over 8 and commit always just over 0.59. I fear that the culprit is that <span>verify-commit</span> just always gets a score around 14.5k, which combined with the density makes it always win over <span>commit</span> even if that latter's score is around 57.5k (my guess is that the density of the git-commit page with a word count of 4079 is just so unfavorable compared to the git-verify-commit page with only 70 words, amirite?)

Help?

dscho commented 8 months ago

Another thing that would help would be if Pagefind exposed more controls on how it ranks content — for example you could opt out of the density ranking to help here.

I think having that option would be awesome!

dscho commented 8 months ago

Another thing that would help would be if Pagefind exposed more controls on how it ranks content — for example you could opt out of the density ranking to help here.

I think having that option would be awesome!

@bglw I tried my hand at it (documentation is still missing, I first want to make sure that this is the right direction): https://github.com/CloudCannon/pagefind/pull/534

dscho commented 8 months ago

I tried my hand at it (documentation is still missing, I first want to make sure that this is the right direction)

There's now documentation, and a test.

dscho commented 2 months ago

why does the generated content need to be versioned in the first place?

Efficiency and reproducibility. The same reason why there is not only https://github.com/jnavila/git-manpages-l10n/ but also https://github.com/jnavila/git-html-l10n (hint: the latter contains assets generated from the former).

to me this sounds like a complete nightmare. not cleanly separating the sources from the generated content is a recipe for undesired "special effects" of all kinds. and that's atop of obvious issues of working with the repo itself.

@ossilator unnecessarily-harsh language notwithstanding, I do see a grain of truth in that statement. Cleanly separating non-generated from generated content is clearly desirable.

I have to admit that I dreaded the huge amount of work to get this done, and the time it took did not disappoint: it took a lot of effort.

The end result is worth it, though, I believe: All generated content is cleanly separated into the external/ directory tree, to be precise:

external/book/ for the ProGit book and its translations,
external/docs/ for the manual pages and their translations.

Both subdirectories fan out into:

content/ for the generated HTML pages
static/ for the copied images
data/ for the derived metadata used e.g. to construct the language context menus of the manual pages
sync/ to record from which external commits the current state was generated

Also, I added instructive "DO NOT EDIT" comments to the generated content, in an attempt to avoid confusing contributors into opening PRs that modify said generated content directly.

ossilator commented 2 months ago

why does the generated content need to be versioned in the first place?

Efficiency

granted, though there is no imperative to use the same repo for that.

and reproducibility.

that's backwards. one versions things that one cannot (easily) reproduce.

in particular, you should snapshot your relevant inputs (both data and build tools) as-is, in separate repositories (of whatever kind). note that if these inputs are externally versioned and have reproducible builds, then technically the snapshot can be quite minimal, e.g. a sha1. but for efficiency, you would probably cache their output artifacts that are your direct inputs.

[...] All generated content is cleanly separated into the external/ directory tree, to be precise: [...]

that's awesome. but to me, this looks like what should be just the first step towards splitting off the build artifacts.

the way i would approach it, things would be laid out this way:

the main repo with just the sources and build scripts
multiple repos with snapshots of external inputs
- the code for updating the snapshots would live in the main repo
- and yes, i'd pull these repos into the main repo via actual git submodules - as horrible as these are for managing co-evolving work, they are just fine for versioning truly external dependencies
a repo with just the build artifacts. that one is served via GH pages. its workflow would just checkout the main repo, invoke the right command from there, and commit the result.

this layout enables you to see what exactly caused the output to change, because each input is versioned separately. and it is friendly to both local builds and rebuilds, and to GH workflows (the main repo's workflow would merely trigger the one in the build artifact repo; this is easy).

dscho commented 2 months ago

why does the generated content need to be versioned in the first place?

Efficiency

granted, though there is no imperative to use the same repo for that.

No. But it is much more convenient to have the same repository for that, in particular in a project that typically sees one-off contributions. No need to make the barrier of entry harder than absolutely necessary, I hop you do agree on that point at least.

and reproducibility.

that's backwards. one versions things that one cannot (easily) reproduce.

And that's exactly the case here. When so many different components are combined, it is hard to reproduce the exact same output. Take a different Hugo version, for example. Or an update in the AsciiDoctor tooling. Or a different Ruby version that has slightly different behavior. I hope you see how many things can result in even subtle differences, so that the best way to allow one-time contributors easily reproduce the exact same data shape locally as the actual homepage has is to cache these pre-rendered pages.

that's awesome. but to me, this looks like what should be just the first step towards splitting off the build artifacts.

I don't want the complexity of even more repositories. I find it unnecessary and counterproductive.

ossilator commented 2 months ago

But it is much more convenient to have the same repository for that,

not really. it optimizes one particular aspect, at the cost of others.

in particular in a project that typically sees one-off contributions. No need to make the barrier of entry harder than absolutely necessary, I hop[e] you do agree on that point at least.

i guess i'm atypical, but as a potential one-off contributor, i'd look, shake my head, and walk away if i saw a repo with 7 million lines of generated text.

When so many different components are combined, it is hard to reproduce the exact same output.

but your approach - unlike mine - doesn't make things more reproducible. it merely archives results from particular build setups, basically hoping that the next contributor won't have to rebuild anything.

fwiw, the proper way to snapshot the tooling/build environment would probably involve a docker image or something like that. having everything relevant in git repos would be a tad insane ...