kjvarga / sitemap_generator

SitemapGenerator is a framework-agnostic XML Sitemap generator written in Ruby with automatic Rails integration. It supports Video, News, Image, Mobile, PageMap and Alternate Links sitemap extensions and includes Rake tasks for managing your sitemaps, as well as many other great features.
MIT License
2.44k stars 275 forks source link

google search console couldn't fetch sitemap but index is ok #417

Open MiroBabic opened 2 years ago

MiroBabic commented 2 years ago

I get couldn't fetch from gsc for sitemap files, but index is read fine. When I put my sitemap to any validator, it shows it is ok.

this is my sitemap.rb

SitemapGenerator::Sitemap.default_host = "https://www.xxxxx.xxxxx"
SitemapGenerator::Sitemap.compress = false
SitemapGenerator::Sitemap.create_index = true

SitemapGenerator::Sitemap.create(:max_sitemap_links=>45000) do

  [:sk, :en].each do |locale|
    add root_path.to_s + locale.to_s
    add "/#{locale}" + list_cities_path
    add "/#{locale}" + invoices_path
    add "/#{locale}" + orders_path
    add "/#{locale}" + contracts_path
    add "/#{locale}" + contractors_path

    City.find_each do |city|
      add "/#{locale}/city/#{city.slug_url}", :lastmod => city.updated_at
    end

    Contractor.find_each do |contractor|
      add "/#{locale}/contractors/#{contractor.id}", :lastmod => contractor.updated_at
    end

    Invoice.find_each do |invoice|
      add "/#{locale}/invoices/#{invoice.id}", :lastmod => invoice.updated_at
    end

    Order.find_each do |order|
      add "/#{locale}/orders/#{order.id}", :lastmod => order.updated_at
    end

    Contract.find_each do |contract|
      add "/#{locale}/contracts/#{contract.id}", :lastmod => contract.updated_at
    end

  end

  # Put links creation logic here.
  #
  # The root path '/' and sitemap index file are added automatically for you.
  # Links are added to the Sitemap in the order they are specified.
  #
  # Usage: add(path, options={})
  #        (default options are used if you don't specify)
  #
  # Defaults: :priority => 0.5, :changefreq => 'weekly',
  #           :lastmod => Time.now, :host => default_host
  #
  # Examples:
  #
  # Add '/articles'
  #
  #   add articles_path, :priority => 0.7, :changefreq => 'daily'
  #
  # Add all articles:
  #
  #   Article.find_each do |article|
  #     add article_path(article), :lastmod => article.updated_at
  #   end
end

what can I do to make it readable for google ?

kjvarga commented 2 years ago

Have you tried accessing some urls that are in your index file from the public Internet?

What’s the url where your robots.txt is hosted or the url of the index?

On Sun, Nov 6, 2022 at 2:04 PM MiroBabic @.***> wrote:

I get couldn't fetch from gsc for sitemap files, but index is read fine. When I put my sitemap to any validator, it shows it is ok.

this is my sitemap.rb

SitemapGenerator::Sitemap.default_host = "https://www.xxxxx.xxxxx" SitemapGenerator::Sitemap.compress = false SitemapGenerator::Sitemap.create_index = true

SitemapGenerator::Sitemap.create(:max_sitemap_links=>45000) do

[:sk, :en].each do |locale| add root_path.to_s + locale.to_s add "/#{locale}" + list_cities_path add "/#{locale}" + invoices_path add "/#{locale}" + orders_path add "/#{locale}" + contracts_path add "/#{locale}" + contractors_path

City.find_each do |city|
  add "/#{locale}/city/#{city.slug_url}", :lastmod => city.updated_at
end

Contractor.find_each do |contractor|
  add "/#{locale}/contractors/#{contractor.id}", :lastmod => contractor.updated_at
end

Invoice.find_each do |invoice|
  add "/#{locale}/invoices/#{invoice.id}", :lastmod => invoice.updated_at
end

Order.find_each do |order|
  add "/#{locale}/orders/#{order.id}", :lastmod => order.updated_at
end

Contract.find_each do |contract|
  add "/#{locale}/contracts/#{contract.id}", :lastmod => contract.updated_at
end

end

Put links creation logic here.

#

The root path '/' and sitemap index file are added automatically for you.

Links are added to the Sitemap in the order they are specified.

#

Usage: add(path, options={})

(default options are used if you don't specify)

#

Defaults: :priority => 0.5, :changefreq => 'weekly',

:lastmod => Time.now, :host => default_host

#

Examples:

#

Add '/articles'

#

add articles_path, :priority => 0.7, :changefreq => 'daily'

#

Add all articles:

#

Article.find_each do |article|

add article_path(article), :lastmod => article.updated_at

end

end

what can I do to make it readable for google ?

— Reply to this email directly, view it on GitHub https://github.com/kjvarga/sitemap_generator/issues/417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAXJEVGOYOJDLYMZFYTHQ3WHATNXANCNFSM6AAAAAARYT7ADY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

MiroBabic commented 2 years ago

yes, when I copy paste link from sitemap index for exact sitemap, it works ok robots.txt are in root folder of web, also sitemap

you can check all here, all links works, robots, sitemaps https://www.openstats.city/robots.txt

kjvarga commented 2 years ago

Yes I'm able to access all the links fine as well. So I'm not sure what the issue is. The sitemaps look fine.

Please be aware that all your Invoice, Contracts and Orders data is publicly accessible!! e.g. https://www.openstats.city/en/invoices. you should secure that ASAP to prevent PII exposure, or worse. At the very least someone could use that data phish those users and get them to click malicious links, knowing details of their interactions with your site.

MiroBabic commented 2 years ago

@kjvarga thanks, but thats ok, it is public data (opendata) and should be accessible to anybody to check where money flows. Thats all from opendata initiative to help with transparency how public money are spent

kjvarga commented 2 years ago

haha oops I was worried!

elcuervo commented 1 year ago

I think I saw this behavior as well. I re ran ping_search_engines with the sitemap index and eventually they got marked correctly. I our scenario I think it was related with the fact that our generated sitemap was huge and probably our cdn throttle some of the bot ips. Also if you are redirecting to S3/Google via the app there might been app errors

MiroBabic commented 1 year ago

I see something strange in my rails app log it looks that google is looking for gz version of sitemap even in index is linked non gziped version

I, [2022-11-15T19:48:10.902295 #1576179]  INFO -- : [1671072e-6015-44ad-aa6c-914bc73f0917] Started GET "/sitemap12.xml.gz" for 66.249.75.241 at 2022-11-15 19:48:10 +0000
[1671072e-6015-44ad-aa6c-914bc73f0917] ActionController::RoutingError (No route matches [GET] "/sitemap12.xml.gz"):