Create Sitemaps - day before prep

This is done! It was painful, though, here's the current sitemap generation process:

From a machine with production Exhibitions data, the UCLDC_EXHIBIT_PREVIEW environment variable unset, the ES_ALIAS configured to rikolti-prd, the UCLDC_FRONT environment variable set to "https://calisphere.org./" and the sites fixture loaded python manage.py loaddata fixtures/sites.json, you must run python manage.py calisphere_refresh_sitemaps

Because I couldn't connect to the exhibits RDS database from my local machine, I got on one of the production beanstalk instances and ran:

source /var/app/venv/*/bin/activate && /opt/elasticbeanstalk/bin/get-config environment | jq -r 'to_entries | .[] | "export \(.key)=\"\(.value)\""' > /tmp/local-setvars.sh && source /tmp/local-setvars.sh
cd /var/app/current
python manage.py calisphere_refresh_sitemaps
aws s3 rm s3://calisphere-static/sitemaps
aws s3 cp sitemaps/* s3://calisphere-static/sitemaps/ --recursive

I've re-written the sitemap generator to use the OpenSearch Scroll API, which allows for deep paging by caching a search results set for a specified amount of time (I've specified 1 minute). This of course has a performance impact on OpenSearch, and so there is a configurable limit to the number of active Scrollsets at the same time - this limit is set to 500, which seems like a reasonable limit.

In my case, running locally, 1 minute was long enough to scroll through even the largest collection (scrolling in sets of 1000 records), but not so long that we hit the 500 max open scrollsets limit. On the ec2, it ran much faster, though, and I hit the 500 scrollset limit pretty quickly. I added time.sleep(2) just before this line: https://github.com/ucldc/public_interface/blob/stage/calisphere/sitemaps.py#L150, though another option would be to reduce the amount of time a scrollset is available - this does run the risk of larger collections not quite making it all the way through, though.

Does the scroll: 1m option sent with each payload refer to the total time the scrollset is open and active, or is the scroll time refreshed each time a request is made against it? It seemed to be the case that the scroll time is refreshed each time a request is made against it. With the 2 second delay - we did have a few instances of more than 30 requests made (collections with more than 30,000 items), and those scrollsets didn't expire, despite the fact that it took more than 60 seconds to get through them.

We could also minimize open scrollsets by first hitting the count API and determining whether or not the collection has more than 1000 items in the first place. Many collections don't and thus don't require a scrollset.

ucldc / public_interface

Create Sitemaps - day before prep #386