gohugoio / hugo

The world’s fastest framework for building websites.
https://gohugo.io
Apache License 2.0
74.07k stars 7.41k forks source link

Add support for big sitemapse (> 50K urls) #9861

Open incognitozen opened 2 years ago

incognitozen commented 2 years ago

What version of Hugo are you using (hugo version)?

$ hugo version
hugo v0.97.3+extended linux/amd64 BuildDate=unknown

Does this issue reproduce with the latest release?

Yes,

If you create a sitemap on a site with over 50K urls , Google complains that the file is too big.

Your Sitemap contains too many URLs. Please create multiple Sitemaps with up to 50000 URLs each and submit all Sitemaps.

urls

I looked at the docs and noticed that there is no way to override this. Technically not a bug, but this makes it difficult to submit sitemap to Google.

sifigi4335 commented 2 years ago

See https://discourse.gohugo.io/t/feature-request-sitemapindex-for-sitemaps-with-50k-links/33214

incognitozen commented 2 years ago

HI @carerragt

Thanks for pointing in the right direction.

Please note davidsneighbour response on that thread.

Often requested, but technically not possible.

This means that there is a reasonable demand and need for this feature. I have a site with a single 'type'. I don't have categories,tags or other taxonomies. There are 79K url's in my site all belonging to the same type.

Hence, Ju52 proposal may not work with me as I don't have different sections on the site. I understand that this may not be the top priority but it is a problem worth fixing. I run hundred's of sites all in wordpress. Atleast 50%+ sites would have more than 50K url's.

sifigi4335 commented 2 years ago

Perhaps what you should be asking is how to split the sitemap list to multiple files. The 50k is a Google limitation, not Hugo's per se.

incognitozen commented 2 years ago

hi @carerragt

Sure, I understand.

It does beg the discussion that the very notion of having a sitemap is to submit to search engines. Without this need, there is no requirement for a sitemap. Both Google and Bing that provide consoles for managing the sitemap submission specifically request a sitemap that is chunked over 50K.

I would open a forum thread but if you solicit community feedback from those that have larger site, they will tell you that this might be a very important feature for them.

midzer commented 2 years ago

I also have a site with 50k+ pages in a single sitemap.

Adopting some kind of automatically splitting due an external limit in Hugo might break things for others. We should complain about the limit at the external search engine at first. Maybe those provider can up the limit to let's say 100k?

incognitozen commented 2 years ago

@midzer

You can certainly try but there is a rational for them to limit the file to 50K url being the size of the file. Try downloading a file that has 50K url and the size will be approx 4MB.

Furthermore, Google and Bing certainly don't need to change their processing pipelines because a static site gen decided that sitemap.xml shouldn't be split. If Hugo wants to be adopted, then the onus of adding features or making changes inline with industry expectation lies with Hugo and not other providers.

FuadEfendi commented 1 year ago

"sitemaps" protocol supports main "sitemap index" with many child "sitemaps" (50k each). Example:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap1.xml.gz</loc>
      <lastmod>2004-10-01T18:23:17+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap2.xml.gz</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

Reference: https://www.sitemaps.org/protocol.html

For example, Amazon.com website used this in past, and they had millions pages. It seems they stopped this ;) perhaps because of billions pages to list?

And BTW Google understands this "index" file. You can put link to it in robots.txt, it is not necessary to submit it explicitly.

FuadEfendi commented 1 month ago

Sitemaps are “nice to have” for SEO; no need to include pagination and taxonomy pages into sitemaps; and in this case (if we ignore generated pages) it is quite easy to write tool which will generate it as part of Hugo build (some JavaScript, run as part of Node command, etc.) - it could be part of custom theme.

FuadEfendi commented 1 month ago

Workaround:

  1. Let Hugo create default sitemap.xml
  2. Download it and split into multiple files, 50 URLs each
  3. Follow https://www.sitemaps.org/protocol.html and create necessary files accordingly, place it in "/static" folder

Note: "sitemaps" are needed for documents which are not reachable from "home"; or, which are not easily reachable. For example, huge websites such as Amazon are in need of sitemaps: the only other way to "reach" product is via search bar.

So, I don't think sitemaps are as so important for static websites as for E-Commerce... "categories" and "pagination" replace it.

FuadEfendi commented 1 month ago

As per documentation at https://gohugo.io/templates/sitemap-template/, we can explicitly use page front matter:

sitemap:
  changeFreq: ""
  disable: false
  filename: sitemap-01.xml
  priority: -1

Hugo also supports sitemapindex.xml generation.

Simple script can traverse your tons of documents and insert sitemap-01.xml for first 50,000, sitemap-02.xml for 2nd, and so on. This is just workaround, but Hugo made huge progress since this ticket was initially created.

jmooring commented 1 month ago

insert sitemap-01.xml for first 50,000, sitemap-02.xml

You are confusing site configuration with front matter override. You cannot override the filename in front matter. That's why the front matter override example in the documentation does not include filename.

FuadEfendi commented 1 month ago

Ok, I didn't know that... but then, to confirm, we have sitemap-index feature, and we still don't have multi-index support? For now, I run local build which generates huge sitemap, then I split is manually & disable sitemap generation, then deploy sitemaps from "static" folder as workaround.

jmooring commented 1 month ago

With a multilingual project we create one sitemap index, and individual sitemaps per language (site). Regardless of whether a project is monolingual or multilingual, we don't split sitemaps based on the number of entries.

That's why this issue is open.

bep commented 1 month ago

I think it's relatively clear what this issue is about. If you want to discuss workarounds, use https://discourse.gohugo.io/

One workaround could be to add your own sitemap template to your theme/project:

https://github.com/gohugoio/hugo/blob/master/tpl/tplimpl/embedded/templates/_default/sitemap.xml

And possibly filter out your 50k most interesting URLs from a SEO perspective ...

FuadEfendi commented 1 month ago

I have 270k modern terminology dictionary, all English, why should I filter "most interesting" terms? My workaround it simple: let Hugo generate huge XML, then take scissors and cut it into 6 pieces; or just write Java application which will generate what I need and place it into "static" folder, I'll need an hour for that. Since it is too hard for Hugo ;)

FuadEfendi commented 1 month ago

Yes, multilingual support adds more complexity

FuadEfendi commented 1 month ago

Anyway, after some more thinking, sitemaps were invented for pages which are not reachable from homepage. For Hugo -based sites, sitemaps are not needed at all; but it is my personal opinion.

I love example with Amazon: they used sitemaps approx. ten years ago; but now, they don't. Perhaps they prefer to upload product listings in different specialized format to Google and other sites.

FuadEfendi commented 1 month ago

Sorry for writing too much, but continuing logically: I had a past "price comparison" site where product pages were reachable only from search results pages; it was nonsense to have "pagination" for such a huge site. So, I used sitemaps to explicitly generate URLs where I wanted the Search Engine to land.

It's important to note that sitemaps are not necessary for typical Articles or blog sites with a well-structured menu/submenu/pagination. They are only required in specific cases, such as the one I encountered: a site with a few hundred thousand products, accessible solely through the Search Bar. In such instances, Google may not discover these pages due to the lack of a link route from Home to Child to Sub-Child, and so on. Therefore, sitemaps are particularly useful for managing large sites. For instance, I disabled pagination for my 270k dictionary site; it's not user-friendly to paginate the letter 'K' with 1000 links on a page, spread across 20 pages. In such cases, sitemaps can help to streamline the user ("robot" lol) experience.

Therefore, in Hugo, the use case for sitemaps is only for huge sites where we are forced to disable pagination.

Some other non-Hugo use cases for sitemaps: SPA (Single-Page Application) which we want to made searchable; and etc.