kjvarga / sitemap_generator

SitemapGenerator is a framework-agnostic XML Sitemap generator written in Ruby with automatic Rails integration. It supports Video, News, Image, Mobile, PageMap and Alternate Links sitemap extensions and includes Rake tasks for managing your sitemaps, as well as many other great features.
MIT License
2.44k stars 276 forks source link

Rails on Heroku with S3 sitemap hosting - google didn't like it #402

Open cameronmccord2 opened 2 years ago

cameronmccord2 commented 2 years ago

We run Rails(6.0.3.7) on Heroku and host our sitemaps on our S3 bucket. The readme worked great except that Google wouldn't accept our sitemaps because they were a different host than our website(OURBUCKET.s3.amazonaws.com vs www.shout.app). Our S3 bucket was fully verified in Google with an html verification file placed in our bucket's /sitemaps/ folder.

What fixed it was to set our sitemaps_host to the same as default_host and add redirects to our routes file for the sitemaps as seen below. This way all sitemap urls that Google sees are to our website and not to an S3 bucket.

# Sitemap Index
get "/sitemaps/sitemap.xml.gz", to: redirect("https://OURBUCKET.s3.amazonaws.com/sitemaps/sitemap.xml.gz")

# Each sub sitemap holds 50,000 urls so this is good for 500,000 urls
(1..10).each do |i|
  get "/sitemaps/sitemap#{i}.xml.gz", to: redirect("https://OURBUCKET.s3.amazonaws.com/sitemaps/sitemap#{i}.xml.gz")
end

We also changed our robots.txt to use our website's redirect instead of an S3 url

Sitemap: https://www.shout.app/sitemaps/sitemap.xml.gz

The readme section https://github.com/kjvarga/sitemap_generator#an-example-of-using-an-adapter didn't work because it had us using the S3 urls and not our site's urls. If you'd like I can submit a PR that updates that section of the readme or adds another section under that section detailing this setup

kjvarga commented 2 years ago

Thanks for the detailed report! This issue has come up a number of times, but usually users are able to resolve it after realizing they didn't follow all steps, like updating the robots.txt file. I'd like to have another user with the same issue confirm your fix before I update the documentation. I'll leave this issue open for a while to see if we get any +1s.

alessandrostein commented 1 year ago

We found a workaround that works perfectly! (using AWS S3 to storage our sitemap files)

#config/sitemap.rb

SitemapGenerator::Sitemap.create_index = true
SitemapGenerator::Sitemap.default_host = 'https://yourdomain.com'
SitemapGenerator::Sitemap.sitemaps_host = 'https://yourdomain.com'
#config/routes.rb

base_sitemap_url = "https://#{ENV['AWS_ASSETS_BUCKET']}.s3.amazonaws.com"
get 'sitemap.xml.gz', to: redirect("#{base_sitemap_url}/sitemap.xml.gz")
get 'sitemap:number.xml.gz',
    to: redirect("#{base_sitemap_url}/sitemap%{number}.xml.gz")

Just make sure your sitemap index is 100% correct!