ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Enable CKAN sitemap extension #174

Closed mwengren closed 5 years ago

mwengren commented 6 years ago

Hopefully this will help our Google crawlability and searchability, so users will be able to find our catalog entries more readily.

http://extensions.ckan.org/extension/sitemap/

Also investigate what connections there are with robots.txt, if any.

benjwadams commented 5 years ago

How much of this will be redundant with the features afforded by #160 via ckanext-dcat?

benjwadams commented 5 years ago

This also came up for discussion amongst some of the data.gov folks: https://github.com/GSA/data.gov/issues/769

benjwadams commented 5 years ago

I tried the sitemap extension on my own dev setup today and it is horribly slow and bogs down the system, even with about 10k datasets. Creating some indexes may help somewhat, but as it tries to dynamically map the entire site from the database contents, it takes a long time.

mwengren commented 5 years ago

@benjwadams If it's not practical or performant, then we can skip enabling this extension. This request was from awhile ago anyway. The goal is to have the IOOS Catalog searchable (but both by regular search engines and Google's beta Dataset search), so I'm not sure if the dcat extension will fit the bill entirely.

Please let me know if it does.

mwengren commented 5 years ago

I just read the data.gov discussion you linked to. Yes, if we do end up generating a sitemap.xml we'd want to do it offline and post the results somewhere for regular HTTP download (similar approach they used). Definitely not generate it dynamically.

This is the same approach that I used when making a data.json file for the NOAA Catalog once upon a time (also would have been impossibly slow to dynamically generate).

benjwadams commented 5 years ago

While I haven't tried the other alternative extensions to generate sitemaps, I suspect I'd run into similar speed issues. In light of that, I'm probably just going to generate a static file with cURL against the container and generate the sitemap once a day, which Nginx will serve up. If we need more than that, we can tweak things. One of the potential issues with using Solr as the "source of truth" is that sometimes the DB and Solr can get out of sync and Solr can be pointing at URLs that don't exist in the DB and will 404.

benjwadams commented 5 years ago

@mwengren, I managed to get a cached, gzipped version of the sitemap up and running today at https://data.ioos.us/sitemap.xml . There were some bugs in the sitemap generation code that caused the sitemap to not be properly generated, so I addressed those. Also, had to optimize some DB queries so that the DB wasn't brought to its knees because of deficiencies in CKAN's default indexing scheme. Also had to modify a parameter for streaming requests within CKAN which was causing requests to die off prematurely due to some sort of application threading error.

I'm going to be generating the sitemaps through cron at about 3 AM EST due to the extremely slow speed of the sitemap generation for the number of datasets. The request legitimately took approximately half an hour just to generate the sitemap, so you can hopefully see why caching the sitemap is desirable here! Deleting old harvest objects may speed up the generation some, but I'm not sure by how much. Hopefully, some better code for generating the sitemap will become available for CKAN in the near future. Once I'm satisfied the scheduled jobs are generating up to date sitemaps, I'll close out this issue.

benjwadams commented 5 years ago

Main site is now running with a daily generated cached sitemap.xml file. Closing this issue out. If we want to optimize the speed of the sitemap generation, it could be revisited at some other time.

mwengren commented 5 years ago

@benjwadams Thanks, just read your previous comment. Caching and service definitely makes sense as the way to go. Even if it's only generated once per week, I think that would suffice. The URLs to the datasets should be pretty static, so we don't necessarily need to generate it on a daily basis . Hopefully search engines will pick it up and keep crawling the same URLs regardless of how often they crawl the sitemap itself.

Next step, Schema.org tagging!