hbstack / blog

HB universal blog module
https://hbstack.dev/
MIT License
8 stars 6 forks source link

Improve build performance for taxonomies #884

Closed razonyang closed 4 months ago

razonyang commented 5 months ago
FuadEfendi commented 5 months ago

RE: build of really large site shows build performance difference from 5ms/page to 17ms/page when I disable/enable single taxonomy

If different theme has the same issues, then it is not theme, but Hugo; I'll try to test; I think it is Hugo architecture. Some other parameters play role too: how many taxonomies to show per page, pagination size, etc.

Right now I don't worry too much: I build locally on laptop (90 minutes), then I use Netlify CLI to upload (3 hours), build at Netlify will fail unless you are on "enterprise" plan.

But even 5ms per page is too slow, in my opinion. My Java application uses less than a minute to parse 250,000 XML files and convert it into Markdown, just to compare, 90 minutes with Hugo to convert 250,000 Markdown files into HTML (with pagination size 1000 and taxonomy & sitemap disabled).

razonyang commented 5 months ago

If different theme has the same issues, then it is not theme, but Hugo

It is hard (maybe imposible) to compare and determind, since every theme has it's own more and less functionalities and build optimizations.

But even 5ms per page is too slow

I'd recommend switching to a faster/minimalist (since I didn't see much pictures) theme for your large site at this moment, I don't believe this theme build speed will faster than 5ms/page after tweaking templates.

razonyang commented 5 months ago

Btw, what's is the output of hugo mod graph, just wrote a script to create test content in bulk, try to diagnose which module slow down the speed.

FuadEfendi commented 5 months ago

I am running more tests now; I enabled Sitemap generation and it took over 5 hours to generate; previously it was about 90 minutes; I cannot believe my eyes so I am rerunning again all tests with explicitly deleting "public" folder before each run.

Here is "baseline", with sitemaps generation enabled:

                   |   EN    
-------------------+---------
  Pages            | 265764  
  Paginator pages  |  23498  
  Non-page files   |      0  
  Static files     |      8  
  Processed images |      9  
  Aliases          |  19796  
  Cleaned          |      0  

Total in 18275832 ms

Pagination is 1000.

razonyang commented 5 months ago

The build speed is much slower than that you posted before, what is the difference between those two build.

It would be helpful if you can provide the source of your site, forget it if there is sensitive information.

FuadEfendi commented 5 months ago

Sorry Razon, I have to be very careful with my comments here; I'll rerun everything again, and I will compare with minimalistic themes too.

I analyzed what changed: I uncarefully commented out this list_style: minimalist in config, maybe this was main reason of 3x slower build.

razonyang commented 4 months ago

Great News!!!

After spending several nights tuning its performance, I gained very little. However, when I accidentally opened the public directory, I discovered a serious bug. After testing, the performance was due to the bug generating unnecessary paginated pages, which has now been fixed.

You can compare those two build performance here, in short, now this theme is faster than before 3-5 times, it took 7min to build 100k pages (including pagination pages), speed 4-5ms/page.

It would be much faster on hign-end environment

I just tested on my laptop (32 GiB RAM, 8 cores (16 threads) CPU), it took 10min to build about 200k pages, 3.2ms/page, I believe the theme may take less than 1hr to build one million normal pages.

image

I'm looking forward how it perform on your site if you're still want to use this theme.

FuadEfendi commented 4 months ago

Hi Razon,

Theme is great, especially for me - since I am developer learning Hugo, it has great features including Node integration (which other "minimalistic" themes don't have good examples); mobile device friendly is super important too, "Responsive", Bootstrap.

I also accidentally noticed weird issue with "file not found" in "hugo_cache" during build, when I wanted to have "content/terms" folder for my dictionary. Some filenames/folder-names probably reserved, error messages are weird, and very hard to troubleshoot.

Here are numbers "before" and "after" upgrade of one of my sites, all settings are the same except modules upgrade:


                   |  EN    
-------------------+--------
  Pages            | 31856  
  Paginator pages  |  3519  
  Non-page files   |     0  
  Static files     |     0  
  Processed images |     9  
  Aliases          | 12768  
  Cleaned          |     0  

Built in 422229 ms
Environment: "development"

After the upgrade:

                   |  EN    
-------------------+--------
  Pages            | 31856  
  Paginator pages  |  1932  
  Non-page files   |     0  
  Static files     |     0  
  Processed images |     9  
  Aliases          | 12764  
  Cleaned          |     0  

Built in 116927 ms
Environment: "development"
Serving pages from disk

I'll try my largest site too, it was around 5 hours build previously, it will take time ;)

razonyang commented 4 months ago

I also accidentally noticed weird issue with "file not found" in "hugo_cache" during build...

This is indeed a headache, but we need reproducible steps to locate the cause, there are too many factors (disk, file permission, theme's bug, Hugo bug and so on), however I didn't meet those issue on Windows, WSL and Linux, it's also hard to troubleshot to me.

After the upgrade: ... Built in 116927 ms ...

Seems much better, the build speed has been increased from 10+ms/page to 3.4ms/page, what I can image the fastest build speed is about to 2-3ms in high-end environment. Since the hooks system (hugopress) brings flexible module hot plugging functionality (install/remove modules without changing themes), but it also comes with some performance losses, so please do not expect it will faster than other optimized themes.

razonyang commented 4 months ago

Closed per above comments.

FuadEfendi commented 4 months ago

With my super large site, I didn't notice significant improvements; also, my tests were not super clean, I have to avoid working on laptop while running tests.

Before upgrade:

                   |   EN    
-------------------+---------
  Pages            | 265764  
  Paginator pages  |  23498  
  Non-page files   |      0  
  Static files     |      8  
  Processed images |      9  
  Aliases          |  19796  
  Cleaned          |      0  

Total in 18275832 ms

After upgrade:


                   |   EN    
-------------------+---------
  Pages            | 265763  
  Paginator pages  |   6983  
  Non-page files   |      0  
  Static files     |      9  
  Processed images |      9  
  Aliases          |  19795  
  Cleaned          |      0  

Total in 19810338 ms

I also found that when I use list_style: minimalist build is about 4x faster, and I don't understand why, "pagination" needs "list_style" and in both cases it needs to retrieve title and description from linked pages.

In average, build is 4200 seconds when I just use list_style: minimalist for blogs, and it becomes 20,000 seconds when I comment it out in config:

  terms:
    # the paginate for categories, tags, series list pages.
    paginate: 1000
    #list_style: minimalist
    profile: false
  blog:
    #list_style: minimalist
    profile: false
FuadEfendi commented 4 months ago

I am not sure, maybe "list_style" tries to generate thumbnails or graphics, that's why it is slow; I don't like "minimalist" because t doesn't show description, not good for SEO; I am checking docs now

FuadEfendi commented 4 months ago

UPDATE: checking https://github.com/hbstack/blog/blob/main/layouts/partials/hb/modules/blog/post/card.html

I can guess only that taxonomies calculations taking place (instead of cached results); plus, I am unsure how Hugo handles this: in case of smaller "terms.html" it can find "term" by using "full table scan"; but in my case, I have at least 10,000 - 100,000 terms, do they use "index" to scan "terms.html"? am not sure; but it adds 4 hours of build time for 250,000 documents site ;) I am only guessing, I don't know Hugo internals

I repeated test with smaller site, "minimalist":

                  |  EN    
-------------------+--------
  Pages            | 68157  
  Paginator pages  |  1533  
  Non-page files   |     0  
  Static files     |     9  
  Processed images |     9  
  Aliases          |  8797  
  Cleaned          |     0  

Total in **247066 ms**

And non-minimalist, regular card:

                   |  EN    
-------------------+--------
  Pages            | 68157  
  Paginator pages  |  1533  
  Non-page files   |     0  
  Static files     |     9  
  Processed images |     9  
  Aliases          |  8797  
  Cleaned          |     0  

Total in **705852 ms**

So, I downloaded "Card.html" and removed "taxonomy" from code, tested again "regular card":

                   |  EN    
-------------------+--------
  Pages            | 68157  
  Paginator pages  |  1533  
  Non-page files   |     0  
  Static files     |     9  
  Processed images |     9  
  Aliases          |  8797  
  Cleaned          |     0  

Total in **263589 ms**

I am not sure, this is from docs: partials.IncludeCached LAYOUT CONTEXT

Could be an issue? We cache it in a "page" context (second $page parameter) instead of global "site" context? So that it never cached?

So, I tested it, altered Card.html line 67, and put . instead of $page as second "context" parameter:

                   |  EN    
-------------------+--------
  Pages            | 68157  
  Paginator pages  |  1533  
  Non-page files   |     0  
  Static files     |     9  
  Processed images |     9  
  Aliases          |  8797  
  Cleaned          |     0  

Total in 271600 ms

Did I find the fix? From 700 seconds to 270 seconds by just fixing line 67?

Before:

      <div class="hb-blog-post-meta d-block text-nowrap text-truncate mb-2">
        {{ partialCached "hb/modules/blog/post/meta/taxonomies" $page $page}}
      </div>

Build time: 700 seconds

After:

      <div class="hb-blog-post-meta d-block text-nowrap text-truncate mb-2">
        {{ partialCached "hb/modules/blog/post/meta/taxonomies" $page . }}
      </div>

Build time: 270 seconds

razonyang commented 4 months ago

{{ partialCached "hb/modules/blog/post/meta/taxonomies" $page . }}

Haven't read it fully, but it's wrong, the taxonomies is related to current card's page, not current context (not a page). You can use this code start a Hugo server, and check your posts..

minimalist

minimalist just list title and date.

With my super large site, I didn't notice significant improvements; also, my tests were not super clean, I have to avoid working on laptop while running tests.

Hmm, I'm surprised the build speed was getting slow after upgrading, since I do see build performance got improved on all my sites... It maybe site's spec, such as Network operations (calling APIs, fetch remote data), images processing, custom templates/shortcodes and so on, or there is potential performance issue, but I'm not able to debug this without source code, couldn't provide help on this.

razonyang commented 4 months ago

I can guess only that taxonomies calculations taking place (instead of cached results); plus, I am unsure how Hugo handles this: in case of smaller "terms.html" it can find "term" by using "full table scan"

Hmm, I didn't look into Hugo source code, theme uses .GetTermspage function to get terms, will take a look if have time

razonyang commented 4 months ago

You maybe right, the taxonomies may be the cause of this.

I just created a site from scratch without any theme.

// layouts/_default/single.html
{{- $page := . }}
{{ $t := debug.Timer "page-taxonomies" }}
{{- range $kind := slice "tags" "categories" }}
  {{ $t1 := printf "page-taxonomies-%s" $kind | debug.Timer }}
  {{- with $page.GetTerms $kind }}
    {{- range . }}
      <span class="blog-post-taxonomy-meta">
        <a
          class="blog-post-taxonomy blog-post-taxonomy badge bg-secondary text-decoration-none fw-normal me-1"
          href="{{ .RelPermalink }}">
          {{- .Title -}}
        </a>
      </span>
    {{- end }}
  {{- end }}
  {{ $t1.Stop }}
{{- end -}}
{{ $t.Stop }}

And then create dummy content with Lorem Ipsum Generator.

lorem-ipsum-generator -n 10000 --tag-count 20 -o content

The script generate 10k posts that contains 20 tags per page.

I used cascade for filtering some posts to compare performances between them.

// hugo.toml
[[cascade]]
[cascade._target]
path = "/{3001-4000,4001-5000,5001-6000,6001-7000,7001-8000,8001-9000,9001-10000}/**"
[cascade.build]
list = "never"
render = "never"
Posts Performance
2000 image
3000 image
5000 image
10k image

As the images shown, the average got increased as the content grows.

The issue seems not related to this theme, will try to create a repo and post a topic on Hugo forum.

FuadEfendi commented 4 months ago

After my "patch" applied:

                   |   EN    
-------------------+---------
  Pages            | 265781  
  Paginator pages  |   6983  
  Non-page files   |      0  
  Static files     |      9  
  Processed images |      9  
  Aliases          |  19804  
  Cleaned          |      0  

Total in 6282292 ms

Before:


                   |   EN    
-------------------+---------
  Pages            | 265763  
  Paginator pages  |   6983  
  Non-page files   |      0  
  Static files     |      9  
  Processed images |      9  
  Aliases          |  19795  
  Cleaned          |      0  

Total in 19810338 ms

I believe this is THE issue,

{{ partialCached "hb/modules/blog/post/meta/taxonomies" $page $page}}
razonyang commented 4 months ago

Haven't read it fully, but it's wrong, the taxonomies is related to current card's page, not current context (not a page).

Hmm, I think I've explained this, please check your site afrer applying your patch, make sure the taxonomies are correct for each posts.

just for an example.

image

And all my post's taxonomies disappear.

image

will be cached in the current page context and this cache won't be available in other pages context for reuse

The taxonomies meta are used on detail page, list pages (sections, tags, categories, archieve and so on), it's not used one time only.

FuadEfendi commented 4 months ago

I was wrong trying to use {{ partialCached "hb/modules/blog/post/meta/taxonomies" $page . }} in Card.html; it doesn't update "card" with proper taxonomy; it uses static constants.

But anyway, the issue is with taxonomies calculations. Why those are not cached in some "dictionary" file and always being recalculated?

razonyang commented 4 months ago

But anyway, the issue is with taxonomies calculations. Why those are not cached in some "dictionary" file and always being recalculated?

Not sure, there may be Hugo's bottleneck for handling a large number of taxonomy terms, as previous test (only one template, no theme) shown, the average excuted time increased apparently as taxonomy terms grows.

razonyang commented 4 months ago

Btw, I created a topic on Hugo forum, please wait for they to reply/confirm if there is something I'm doing wrong.

FuadEfendi commented 4 months ago

I did test with about 70,000 pages, similar results, 3x difference between "minimalist" (without taxonomy) and regular; 250 seconds "minimalist", 700 seconds "regular".

Yes, better to ask Hugo.

I can imagine hude file containing precalculated taxonomies in (my best hope) alphabetical sorted order, Log(n) (best hope) search algorithm, or even better, separate "index" file; but I feel that "partialCached" doesn't use this. Taxonomies should be precalculated and cached. For smaller sites, it is not visible, like in this example, 250 seconds vs. 700 seconds (and thanks to you Razon this is huge improvement to what it was few days ago!)

razonyang commented 4 months ago

I did test with about 70,000 pages, similar results, 3x difference between "minimalist" (without taxonomy) and regular; 250 seconds "minimalist", 700 seconds "regular".

Will check if the cached is used on non-minimalist style.

FuadEfendi commented 4 months ago

Also, it is strange that it still takes time: I use taxonomies.count = false in config, it must be instant, no calculations required. It is like outputting just link to taxonomy page vs. outputting link (no need to calculate) and count (needs to be calculated). Maybe it should be sorted by count, then I can understand... but in the Card, probably better to sort alphabetically.

razonyang commented 4 months ago

taxonomies.count

What is this parameter used for? I couldn't recall.

in the Card, probably better to sort alphabetically.

It's order is same as front matter.

FuadEfendi commented 4 months ago
      taxonomies:
        count: false # whether to show the number of posts associated to the item.
        limit: 100 # the maximum number of the item.
razonyang commented 4 months ago

I did test with about 70,000 pages, similar results, 3x difference between "minimalist" (without taxonomy) and regular; 250 seconds "minimalist", 700 seconds "regular".

image

The caches is correct, you can see there is 100% cached, the speed difference between the two is due to the simplicity of minimalist, which only displays the title and date, you can override the layouts/partials/hb/modules/blog/posts-minimalist.html to suit your needs and gain better performance.

Also, it is strange that it still takes time: I use taxonomies.count = false in config

This won't affect performance,the sidebar's taxonomies was cached, beside the time it takes up is almost negligible (just 150ms on 30k pages site with debug mode).

image

See also https://discourse.gohugo.io/t/getterms-getting-slows-as-the-content-grows/50332/2?u=razon, the time of page's taxonomies is linear increase.

Currently, you can tweak posts-minimalist template for gainning better performance.

razonyang commented 4 months ago

Hugo team have submited some improvements, I'm not sure if it's helpful in your cases, you can build from source to confirm it.

git clone https://github.com/gohugoio/hugo
cd hugo
go get
go build -tags extended
cd /path/to/your-site
/path/to/gohugoio/hugo/hugo

Build several times and take average to compare with previous build.

FuadEfendi commented 4 months ago

Thank you Razon, I appreciate it very much, trying it now...

FuadEfendi commented 4 months ago

Performance definitely improved, approx. 4x times!!! I even enabled "tags" taxonomy (it was disabled before), what was taking 5 hour, now takes approx. 1:20


                   |   EN    
-------------------+---------
  Pages            | 265669  
  Paginator pages  |   6980  
  Non-page files   |      0  
  Static files     |      9  
  Processed images |      9  
  Aliases          |  19810  
  Cleaned          |      0  

Total in 4608877 ms
razonyang commented 4 months ago

Performance definitely improved, approx. 4x times!!! I even enabled "tags" taxonomy (it was disabled before), what was taking 5 hour, now takes approx. 1:20

Was minimalist enabled? If it wasn't enabled, that is really impressive, then I'll close the forum topic later.

razonyang commented 4 months ago

BTW, how much taxonomies (tags + categories) do you have on this site?

FuadEfendi commented 4 months ago

I made mistake; my "tags" had only cardinality 4, very few documents with "tags"; "category" maybe 300-1000; I tried to run with "category x keywords" double-taxonomy, my estimate is 1000 x 10,000 cardinalities, it is already 12+ hours still running. But just "category" taxonomy works 4x faster now.

razonyang commented 4 months ago

I tested and compared the performance between of v0.127.0 and next version of Hugo, as a result, the GetTerms has a marked improvement.

If your site still have performance issue, you may need to provide a site source or reproducible repo for me to debug and locate the cause, it's very hard to debug via guessing.

FuadEfendi commented 4 months ago

My site doesn't have any specific to my site issues; it is generic Hugo design issue. I'll try to use "Generator" tool to reproduce in separate repo; it is not theme related.