sciencehistory / scihist_digicoll

Science History Institute Digital Collections
Other
11 stars 0 forks source link

Fix SiteMap storage for new CloudFront situation #2689

Closed jrochkind closed 1 month ago

jrochkind commented 1 month ago

Getting an S3 error when our nightly sitemap generation routine runs. https://app.honeybadger.io/projects/58989/faults/109739811

Aws::S3::Errors::AccessDenied: Access Denied from ./configs/sitemap.rb:31

Diagnosis

so the SiteMap was being stored on derivatives bucket… with a public ACL… As a result of #2667 and https://github.com/sciencehistory/terraform_scihist_digicoll/pull/84 the bucket is now set to reject public ACLs so it raises an

the bucket is now set to reject public ACLs… so we can store it without that…. but we need to make sure the sitemap URL we are actually delivering to google and other crawlers is the cloudfront public one that will work!

jrochkind commented 1 month ago
Sitemap: https://s3.amazonaws.com/<%= ScihistDigicoll::Env.lookup("s3_sitemap_bucket") %>/<%= ScihistDigicoll::Env.lookup("sitemap_path") %>sitemap.xml.gz

Yep, we're currently delivering a URL the public, including search engines like google, can't actually get to, since it's using direct bucket URL!

Have to change this to something that generates properly, ideally using shrine storages.

Then have to let the sitemap generation succeed by not trying to set public ACL.