ucldc / public_interface

Calisphere public interface source code (UCLDC Project) master branch should match live site
https://calisphere.org/
2 stars 5 forks source link

Create Sitemaps - day before prep #386

Closed amywieliczka closed 5 months ago

amywieliczka commented 5 months ago

After we delete all the collections not ready for publication from the Rikolti index, we need to manually generate sitemaps, stash them into s3, and then update public_interface postdeploy and predeploy hooks and confighooks to pull sitemaps from the new s3 bucket, rather than the old s3 bucket.

This will represent a breaking change of beanstalk app version between the AWS pad-dsc account's calispheres (prod and -test) and the AWS pad-prd account's calispheres (prod and -stage), so we should also take note of the app version just prior, and freeze the pad-dsc account's calispheres on that app version and disable calisphere-test's code build pipeline.

amywieliczka commented 5 months ago

This is done! It was painful, though, here's the current sitemap generation process:

From a machine with production Exhibitions data, the UCLDC_EXHIBIT_PREVIEW environment variable unset, the ES_ALIAS configured to rikolti-prd, the UCLDC_FRONT environment variable set to "https://calisphere.org./" and the sites fixture loaded python manage.py loaddata fixtures/sites.json, you must run python manage.py calisphere_refresh_sitemaps

Because I couldn't connect to the exhibits RDS database from my local machine, I got on one of the production beanstalk instances and ran:

source /var/app/venv/*/bin/activate && /opt/elasticbeanstalk/bin/get-config environment | jq -r 'to_entries | .[] | "export \(.key)=\"\(.value)\""' > /tmp/local-setvars.sh && source /tmp/local-setvars.sh
cd /var/app/current
python manage.py calisphere_refresh_sitemaps
aws s3 rm s3://calisphere-static/sitemaps
aws s3 cp sitemaps/* s3://calisphere-static/sitemaps/ --recursive

I've re-written the sitemap generator to use the OpenSearch Scroll API, which allows for deep paging by caching a search results set for a specified amount of time (I've specified 1 minute). This of course has a performance impact on OpenSearch, and so there is a configurable limit to the number of active Scrollsets at the same time - this limit is set to 500, which seems like a reasonable limit.

In my case, running locally, 1 minute was long enough to scroll through even the largest collection (scrolling in sets of 1000 records), but not so long that we hit the 500 max open scrollsets limit. On the ec2, it ran much faster, though, and I hit the 500 scrollset limit pretty quickly. I added time.sleep(2) just before this line: https://github.com/ucldc/public_interface/blob/stage/calisphere/sitemaps.py#L150, though another option would be to reduce the amount of time a scrollset is available - this does run the risk of larger collections not quite making it all the way through, though.

Does the scroll: 1m option sent with each payload refer to the total time the scrollset is open and active, or is the scroll time refreshed each time a request is made against it? It seemed to be the case that the scroll time is refreshed each time a request is made against it. With the 2 second delay - we did have a few instances of more than 30 requests made (collections with more than 30,000 items), and those scrollsets didn't expire, despite the fact that it took more than 60 seconds to get through them.

We could also minimize open scrollsets by first hitting the count API and determining whether or not the collection has more than 1000 items in the first place. Many collections don't and thus don't require a scrollset.