jncc / datahub

The JNCC datahub - our online web repository of open data and publications.
1 stars 1 forks source link

Sitemap #56

Closed completer closed 5 years ago

completer commented 5 years ago

To enable Google etc. to search, we need to generate a sitemap, or a least a browse, so that every asset is reachable from the homepage.

mattdebont commented 5 years ago
  1. Calculate sitemap.xml on request;
    • Always up to date with latest in database
    • Table scan dynamodb and create xml on request
      • Trivial attack vector
        • Can force us to spend time re-generating xml over and over again, forcing us to scale up for robot scans of the system (there are many)
      • It doesn't change apart from uploads so this is wasted, need to cache
  2. Update sitemap.xml as part of upload
    • 1 to 1 operation, simple to understand in principle
    • What happens with manual deletes? Need another chain to handle removing entries from the hub manually
    • Updates many times per ingested asset
  3. Calculate and cache sitemap statically
    • Daily sitemap update at x time
    • Cache result to S3
    • Run from DynamoDB or Elasticsearch
      • arguably dynamodb is the better place for this, start from single source of truth, if something is in dynamodb it is 'published' no matter if its in the search results or not
    • use published since last run? to reduce complexity?
      • what about manual deletes again?
    • https://medium.com/vevo-engineering/scalable-sitemap-solutions-with-aws-lambda-js-fd926c861c0d
      • lambda job to update the sitemap, run daily? at x in the morning
        • retireve all datahub assets
          • for each asset add to sitemap-assets.xml
          • for each resource in asset add to sitemap-resources.xml
          • store in s3? have the app just front this delivery?
          • update sitemap.xml to point to these (up to 50,000 records / 50MB) per file as per https://www.sitemaps.org/protocol.html#index should be fine to deal with compressed here so only real limit is likely the 50000 records potentially
mattdebont commented 5 years ago

robots.txt =>

user-agent: *
disallow: /
sitemap: https://hub.jncc.gov.uk/sitemap.xml

sitemap.xml =>

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
      <loc>http://www.example.com/sitemap-assets.xml.gz</loc>
      <lastmod>2019-04-01T03:00:00+00:00</lastmod>
      <changefreq>daily</changefreq>
   </sitemap>
   <sitemap>
      <loc>http://www.example.com/sitemap-resources.xml.gz</loc>
      <lastmod>2019-04-01T03:00:00+00:00</lastmod>
      <changefreq>daily</changefreq>
   </sitemap>
</sitemapindex>

sitemap-assets.xml.gz / sitemap-resources.xml.gz =>

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>https://hub.jncc.gov.uk/asset/xxxx-xxxxxx-xxxxxxx-xxxxx</loc>
      <lastmod>2019-04-01T03:00:00+00:00</lastmod>
      <changefreq>monthly</changefreq>
   </url>
   ....
</urlset> 
jonparsonsjncc commented 5 years ago

Pete to merge

completer commented 5 years ago

@jonparsonsjncc, @mattdebont has pushed this live - please check https://hub.jncc.gov.uk/sitemap.xml

jonparsonsjncc commented 5 years ago

Looks the correct format

But there are only 6 pages shown and the site has a lot more than this

Also the individual pdfs are not shown in it or the about us page, search page etc

Please advise

From: Pete Montgomery notifications@github.com Sent: 19 June 2019 15:55 To: jncc/datahub datahub@noreply.github.com Cc: Jon Parsons Jon.Parsons@jncc.gov.uk; Mention mention@noreply.github.com Subject: Re: [jncc/datahub] Sitemap (#56)

@jonparsonsjncchttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjonparsonsjncc&data=02%7C01%7CJon.Parsons%40jncc.gov.uk%7C35b47963ddd74bbc98bb08d6f4c619b8%7C444ee4e8b2fd491d8c318b0508370a6b%7C0%7C0%7C636965529013091212&sdata=GIeLBGdihXICL%2FxjHMfG6o5bMn3Vi%2B2zZuTGmSMFpaU%3D&reserved=0, @mattdebonthttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmattdebont&data=02%7C01%7CJon.Parsons%40jncc.gov.uk%7C35b47963ddd74bbc98bb08d6f4c619b8%7C444ee4e8b2fd491d8c318b0508370a6b%7C0%7C0%7C636965529013091212&sdata=mrygK7s%2BHGri0OHqG0BP3cKB0SL%2FzaQuR%2FaWUuAK9Gs%3D&reserved=0 has pushed this live - please check https://hub.jncc.gov.uk/sitemap.xmlhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhub.jncc.gov.uk%2Fsitemap.xml&data=02%7C01%7CJon.Parsons%40jncc.gov.uk%7C35b47963ddd74bbc98bb08d6f4c619b8%7C444ee4e8b2fd491d8c318b0508370a6b%7C0%7C0%7C636965529013101200&sdata=%2B3jk3XMZQR4xZ%2BRUqsYzejg4qEw6OmRaP07mNjTHf7w%3D&reserved=0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjncc%2Fdatahub%2Fissues%2F56%3Femail_source%3Dnotifications%26email_token%3DAMB55HIM4NF5KTQN2LYE773P3JCEBA5CNFSM4GPYFA62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYCEESA%23issuecomment-503595592&data=02%7C01%7CJon.Parsons%40jncc.gov.uk%7C35b47963ddd74bbc98bb08d6f4c619b8%7C444ee4e8b2fd491d8c318b0508370a6b%7C0%7C0%7C636965529013101200&sdata=m0LhMfJXGyv9V70ZqOT3XX%2BNFmCTl37LwjGt5%2BsM7Nc%3D&reserved=0, or mute the threadhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAMB55HNA2ZFVD6CGLRVNCDLP3JCEBANCNFSM4GPYFA6Q&data=02%7C01%7CJon.Parsons%40jncc.gov.uk%7C35b47963ddd74bbc98bb08d6f4c619b8%7C444ee4e8b2fd491d8c318b0508370a6b%7C0%7C0%7C636965529013111205&sdata=cEE3fEjRJbe4oJHv7k3cXclxQlNAOaHaJe0tktrVBjk%3D&reserved=0.


For information on how we handle personal data please see our Privacy Notice at http://jncc.defra.gov.uk/privacypolicy

This email and any attachments, is intended for the named recipient(s) only. If you are not the named recipient then any copying, distribution, storage or other use of the information contained in them is strictly prohibited. In this case, please inform the sender straight away then destroy the email and any linked files.

JNCC may have to make this message, and any reply to it, public if asked to under the Freedom of Information Act, data protection legislation or for litigation. If you have a Freedom of Information/Environmental Information request please refer to our website page.

This message has been checked for all known viruses by JNCC through the MessageLabs Virus Control Centre however we can accept no responsibility once it has left our systems. The recipient should check any attachment before opening it.

JNCC Support Co. registered in England and Wales, Company No. 05380206. Registered Office: Monkstone House, City Road, Peterborough, Cambridgeshire PE1 1JY. http://jncc.defra.gov.uk/