sciencehistory / chf-sufia

sufia-based hydra app
Other
9 stars 4 forks source link

google sitemap #778

Closed jrochkind closed 6 years ago

jrochkind commented 7 years ago

Generate google site map of all works, for better google indexing.

Include "Image" extension for google images.

jrochkind commented 6 years ago

How ScholarSphere does sitemap....

https://github.com/psu-stewardship/scholarsphere/search?utf8=%E2%9C%93&q=sitemap&type=

https://github.com/psu-stewardship/scholarsphere/blob/develop/config/sitemap.rb

https://github.com/psu-stewardship/scholarsphere/blob/630be14f4b0675a71a8b4b9181c5e263b7d14277/config/schedule.rb#L17

jrochkind commented 6 years ago

@sanfordd and I decided it probably made sense to store in S3.

There is some built-in support to the sitemap gem to do this, via a couple different methods.

https://github.com/kjvarga/sitemap_generator/wiki/Uploading-the-sitemap-to-S3-with-paperclip,-aws-s3-and-aws-sdk

https://github.com/kjvarga/sitemap_generator/wiki/Generate-Sitemaps-on-read-only-filesystems-like-Heroku

Note sitemap URL is given in robots.txt.

jrochkind commented 6 years ago

@sanfordd when you're back, let's talk where to store the sitemap and where/how to run the code to generate it, again.

jrochkind commented 6 years ago

Works is substantially done in #962, but since we don't want to deploy until after rebrand, will wait until after that to merge and fully test.

Sitemap includes images to hopefully get our images in Google Images.

After sitemap exists, we may want to:

  1. Ping search engines
  2. Use google site admin to 'claim' the S3 hosts that have our images, and sitemap files. May not be needed, not entirely clear and hard to test to see if it's working without, safest just to do. @MDiMeo will need to create a google account to claim websites to do site admin, maybe one with digital-tech@ email addr? https://support.google.com/webmasters/answer/34592?visit_id=1-636531068483226339-3826640717&rd=1
jrochkind commented 6 years ago

sitemap existing, and set up in google Search Console tools. i'll wait to close until we verify in Search Console that google has actually crawled some, which I think we can.

https://www.google.com/webmasters/tools/sitemap-list?hl=en&authuser=0&siteUrl=https%3A%2F%2Fs3.amazonaws.com%2Fscih-data-production%2F#MAIN_TAB=0&CARD_TAB=-1

jrochkind commented 6 years ago

Google is being slow at indexing, but everything seems to be set up right, nothing else we can do. Closing this ticket.

At present, google has indexed 339 of 5186 works in the sitemap, and zero of 14.443 images. Progress can be checked at link above.

jrochkind commented 6 years ago

Weekly sitemap generation seems to be working, looking at the S3 bucket, sitemap file was last updated Feb 13 at 4:08am GMT, so great.

Google currently has 5114 indexed pages of 5,190 listed in sitemap. And zero of 14,630 images.