Performance and Load Testing

User Story

So that I know Geostore can scale and handle vast amounts of data, as a user, I want to ensure that I can continue adding datasets to Geostore without hitting any performance bottleneck. When new datasets are added (and existing datasets retrieved), I want to get a response in a timely manner.

Currently we have no visibility on how Geostore performs when put under load. How long does it take for Geostore to update its catalog given that there is already an existing large amounts of datasets in place. Can it scale? Does it suffer from any performance bottleneck (e.g. can PyStac traverse through the entire tree in s3 efficiently, can lambda handle these requests without timing out)? Some metrics would be helpful to identify any potential roadblocks that should be looked at early on.

Acceptance Criteria

[ ] Given an existing 100 datasets spanning 10-500 GB in size in Geostore, when an additional 3 datasets spanning 10-500GB are added then the test fails if the confirmation response of all datasets being finished is returned in greaster than 30 minutes, and there is a warning if it takes longer than 10 minutes.

Additional context

The non-functional requirements listed here, provide a minimum baseline for testing, but we should probably have a higher threshold that that for our own testing. These are probably the two most relevant NFRs

Scability (users): Can scale to 100 concurrent users Performance: A large dataset (500 GB and 50,000 files - e.g. Akl aerial imagery) can be validated, imported and stored within 24 hours

https://github.com/linz/geostore/blob/master/.github/ISSUE_TEMPLATE/user_story.md#definition-of-done

Tasks

[ ] Metrics provided.

Definition of Ready

[ ] This story is ready to work on
- [ ] Independent (story is independent of all other tasks)
- [ ] Negotiable (team can decide how to design and implement)
- [ ] Valuable (from a user perspective)
- [ ] Estimate value applied (agreed by team)
- [ ] Small (so as to fit within an iteration)
- [ ] Testable (in principle, even if there isn't a test for it yet)
- [ ] Environments are ready to meet definition of done
- [ ] Resources required to implement will be ready
- [ ] Everyone understands and agrees with the tasks to complete the story
- [ ] Release value (e.g. Iteration 3) applied
- [ ] Sprint value (e.g. Aug 1 - Aug 15) applied

Definition of Done

[ ] This story is done:
- [ ] Acceptance criteria completed
- [ ] Automated tests are passing
- [ ] Code is peer reviewed and pushed to master
- [ ] Deployed successfully to test environment
- [ ] Checked against CODING guidelines
- [ ] Relevant new tasks are added to backlog and communicated to the team
- [ ] Important decisions recorded in the issue ticket
- [ ] Readme/Changelog/Diagrams are updated
- [ ] Product Owner has approved acceptance criteria as complete
- [ ] Meets non-functional requirements:
  - [ ] Scalability (data): Can scale to 300TB of data and 100,000,000 files and ability to increase 10% every year
  - [ ] Scability (users): Can scale to 100 concurrent users
  - [ ] Cost: Data can be stored at < 0.5 NZD per GB per year
  - [ ] Performance: A large dataset (500 GB and 50,000 files - e.g. Akl aerial imagery) can be validated, imported and stored within 24 hours
  - [ ] Accessibility: Can be used from LINZ networks and the public internet
  - [ ] Availability: System available 24 hours a day and 7 days a week, this does not include maintenance windows < 4 hours and does not include operational support
  - [ ] Recoverability: RPO of fully imported datasets < 4 hours, RTO of a single 3 TB dataset < 12 hours

This seems like a good idea to me. But would be good to get your opinion on this @l0b0.

Updated the acceptance criteria to remove 'data retrieval' (retrieval is done through standard S3 API, so I don't think we need to test this) and I added in some useful numbers we can use for this. Probably worth discussing as a team though. Also added a link to the NFRs for context.

Pros:

Doing this sort of testing is in general a very good idea, because it enables us to be proactive about performance and user expectations.
The acceptance criteria look like a good way to get started.

Cons:

Non-negligible AWS cost.
Non-negligible development (initial + maintenance) cost.

Open questions:

How often do we need to run the test for this to be useful? We have a bunch of options (and we can of course change between them as we see fit), including but not limited to:
- Continuously on master, useful to iron out any large variance between runs. The value of seeing how much the results vary should not be underestimated. We might catch intermittent performance issues, and we also learn just how much we can trust a single result to indicate whether things are OK or not.
- Per branch push, useful to catch regressions before merging.
- Once per [period] on master. Catches regressions after merging them, but hopefully before releasing them. One minor advantage of this is that we shouldn't need to worry about concurrent runs if the period is big enough for all of them to finish (obviously not applicable with very long runs).
- Pre-release. Avoids releasing regressions, but of course stalls releasing possibly important updates at the very worst moment.
We could also run this rarely automatically but also make sure it's easy to run manually whenever we're working with potentially very performance-impacting changes. Unfortunately, in my experience it is very difficult to guess which changes have a big performance impact. Sometimes it's a new feature, sometimes it's a one-liner.
How much AWS costs are we going to incur per month? We should know at least approximately how much we'd need to spend for this to be useful. My guess would be in the $1,000-$10,000 per month. If we're going to do this, we need to make sure that everybody is aware of the approximate costs, and that we won't get repeated questions about why and what we can do to reduce it.
Where can we save? Could we keep the "base" dataset around indefinitely, paying the S3 running costs rather than the injection cost per run? One possible issue here is how to deal with metadata or other upgrades, which will have to be applied to this dataset the same as the current production dataset. Because of this it probably makes sense to keep the test environment the same as every other environment, and upgrade it whenever we merge to master (like we do for nonprod). That way we'll be pretty safe.
Do we want to get more information out of each run? At another job we had great success with collecting all sorts of system-independent (CPU/RAM/disk/network/etc.) and system-dependent (in our case probably things like runtime for each individual job in the pipeline and number of concurrent batch jobs) stats. We then created a bunch of dashboards, enough to show all the interesting stats, and created a web page for each dashboard which would switch between the two most recent runs. This was then put on a physical monitor dedicated to information radiators, so that we could trivially compare the last two runs every morning. If things changed we would spend a bit of time trying to work out why, or whether it was most likely a fluke. This was an excellent way to get familiar with the performance characteristics of the application, and to detect any regressions in a timely manner with minimal effort. (That's not to say the effort to develop this was in any way minimal.)

Decided we should discuss this further to decide what/if we can do for this. But in principle a good idea. Might also be worth looking at specific issues as reported by @amfage. @Jimlinz feel free to add any initial thoughts here to help with that discussion later.

My approach would be to set up a simple performance test (e.g. having a preloaded set of data on s3 that we can add to, and measure the time it takes to create and add an additional x number of dataset), and expand from there as usage grow. That way we can justify the costs (whether it is AWS or development time) along the way if we need to expand the scope of our performance test as a later stage.

For the initial setup:

My preference would be to run it manually as needed and automatically once per month(?) on master. Running this continuously and on per branch push is going to incur too much costs and likely attract some questions from management.
I think having a breakdown on the metrics between each step can be useful (e.g. time to make an api call and get a response, time to perform each step within state machine, time to run each lambda function). That way, if performance degrades, we should be able to narrow down where it is taking longer. This might not tell the whole story, but it provides a good indication.
S3 storage pricing is $0.025USD per GB for the first 50TB / month in Sydney (this is the base cost and doesn't include things like transfer costs and other overhead). My vote would be to keep a "base" dataset around indefinitely. We might need a way to update the metadata later on to keep up with upgrades, but I'd say we leave this decision till the need arise.
Dashboard along with other metrics (e.g. network performance) would be useful, but incurs a fairly high upfront development cost. We might get more value out of this as we grow. Maybe something for the backlog. We may be able to justify this once we hit certain threshold in terms of usage and support.

Doing this properly and to get the most value out of the exercise is going to costs (AWS and development time). That said, having something (even if it is a simple test to gather basic metrics once in a blue moon) is better than having nothing at all. We don't really want to find out from the end user on the 11th hour complaining that it is too slow for purpose. Without a benchmark, we wouldn't know if the system is able to scale (uploading one or two datasets from a developer's machine may not highlight some of the underlying problems with performance). My vote is to setup a simple test for now, and add additional metrics / measurement as the need arise.

linz / geostore