Open ksonda opened 11 months ago
Have a working version of the sitemap index generation that reflects the last time an update was made to the source of the urlset (csv file or xml files). Am planning on modifying this to run as a standalone GitHub Action that can be implemented in pids.geoconnex.us.
Thinking about generating sitemaps for regex: and explicit truth of a regex namespace is that there are a lot of features. Having a URL to download a file where the corresponding sitemap can be generated becomes a problem as the PID list grows. I suggest we invest more effort in the tooling to generate sitemaps indexes and urlsets from arbitrary source (as I am planning to implement in the GitHub action) - to promote contributors to generate and include their urlset files to reduce the exponential growth this entails.
Having a mechanism to only regenerate sitemaps that have a change instead of all sitemaps anytime ANY namespace changes.
As per meeting, proposed strategy:
Use github Action/ pygeoapi container to generate sitemaps from
a) zipped csv template PR'd directly to github or b) ESRI/CKAN/Socrata/ any remote geospatial file with the URIs in the data. + the attribute name for the URis
The User's decision tree is:
1) (<300,000 sites) Submit regular csv 2) (between 300,000 and 2,000,000 sites) Submit multiple regular csv with different filenames 3) (>2,000,000 sites OR can maintain a remote endpoint and don't want to interact with github to update) Submit regex csv w/ endpoint + pygeoapi provider name + attribute id)
harvest.geoconnex.us ideally will automatically recrawl all new or modified resources added to the PID registry.
harvest.geoconnex.us uses sitemap.xml to crawl resources
Therefore, we need a away to process diffs between sitemap.xml according to PID additions or other triggered recrawls by data contributors.
Suggestion: add releases of zipped sitemap_XXX.xml files, so that harvest.geoconnex.us can download last release, to compare with the contents of sitemap_XXX.xml directed to by live sitemap index https://geoconnex.us/sitemap.xml
Suggestion: change how sitemap.xml is generated so that lastmod reflects the true last filechange datetime by csv file in /namespaces