internetofwater / geoconnex.us

URI registry for https://geoconnex.us based URIs
Other
23 stars 14 forks source link

Create sitemap publishing and versioning control #193

Open ksonda opened 11 months ago

ksonda commented 11 months ago

harvest.geoconnex.us ideally will automatically recrawl all new or modified resources added to the PID registry.

harvest.geoconnex.us uses sitemap.xml to crawl resources

Therefore, we need a away to process diffs between sitemap.xml according to PID additions or other triggered recrawls by data contributors.

Suggestion: add releases of zipped sitemap_XXX.xml files, so that harvest.geoconnex.us can download last release, to compare with the contents of sitemap_XXX.xml directed to by live sitemap index https://geoconnex.us/sitemap.xml

Suggestion: change how sitemap.xml is generated so that lastmod reflects the true last filechange datetime by csv file in /namespaces

webb-ben commented 11 months ago

Have a working version of the sitemap index generation that reflects the last time an update was made to the source of the urlset (csv file or xml files). Am planning on modifying this to run as a standalone GitHub Action that can be implemented in pids.geoconnex.us.

Thinking about generating sitemaps for regex: and explicit truth of a regex namespace is that there are a lot of features. Having a URL to download a file where the corresponding sitemap can be generated becomes a problem as the PID list grows. I suggest we invest more effort in the tooling to generate sitemaps indexes and urlsets from arbitrary source (as I am planning to implement in the GitHub action) - to promote contributors to generate and include their urlset files to reduce the exponential growth this entails.

Having a mechanism to only regenerate sitemaps that have a change instead of all sitemaps anytime ANY namespace changes.

ksonda commented 11 months ago

As per meeting, proposed strategy:

Use github Action/ pygeoapi container to generate sitemaps from

a) zipped csv template PR'd directly to github or b) ESRI/CKAN/Socrata/ any remote geospatial file with the URIs in the data. + the attribute name for the URis

The User's decision tree is:

1) (<300,000 sites) Submit regular csv 2) (between 300,000 and 2,000,000 sites) Submit multiple regular csv with different filenames 3) (>2,000,000 sites OR can maintain a remote endpoint and don't want to interact with github to update) Submit regex csv w/ endpoint + pygeoapi provider name + attribute id)