linz / geostore

Central storage, management and access for important geospatial datasets
MIT License
33 stars 2 forks source link

Performance and Load Testing #1911

Open Jimlinz opened 2 years ago

Jimlinz commented 2 years ago

User Story

So that I know Geostore can scale and handle vast amounts of data, as a user, I want to ensure that I can continue adding datasets to Geostore without hitting any performance bottleneck. When new datasets are added (and existing datasets retrieved), I want to get a response in a timely manner.

Currently we have no visibility on how Geostore performs when put under load. How long does it take for Geostore to update its catalog given that there is already an existing large amounts of datasets in place. Can it scale? Does it suffer from any performance bottleneck (e.g. can PyStac traverse through the entire tree in s3 efficiently, can lambda handle these requests without timing out)? Some metrics would be helpful to identify any potential roadblocks that should be looked at early on.

Acceptance Criteria

Additional context

The non-functional requirements listed here, provide a minimum baseline for testing, but we should probably have a higher threshold that that for our own testing. These are probably the two most relevant NFRs

Scability (users): Can scale to 100 concurrent users Performance: A large dataset (500 GB and 50,000 files - e.g. Akl aerial imagery) can be validated, imported and stored within 24 hours

https://github.com/linz/geostore/blob/master/.github/ISSUE_TEMPLATE/user_story.md#definition-of-done

Tasks

Definition of Ready

Definition of Done

billgeo commented 2 years ago

This seems like a good idea to me. But would be good to get your opinion on this @l0b0.

billgeo commented 2 years ago

Updated the acceptance criteria to remove 'data retrieval' (retrieval is done through standard S3 API, so I don't think we need to test this) and I added in some useful numbers we can use for this. Probably worth discussing as a team though. Also added a link to the NFRs for context.

l0b0 commented 2 years ago

Pros:

Cons:

Open questions:

billgeo commented 2 years ago

Decided we should discuss this further to decide what/if we can do for this. But in principle a good idea. Might also be worth looking at specific issues as reported by @amfage. @Jimlinz feel free to add any initial thoughts here to help with that discussion later.

Jimlinz commented 2 years ago

My approach would be to set up a simple performance test (e.g. having a preloaded set of data on s3 that we can add to, and measure the time it takes to create and add an additional x number of dataset), and expand from there as usage grow. That way we can justify the costs (whether it is AWS or development time) along the way if we need to expand the scope of our performance test as a later stage.

For the initial setup:

Doing this properly and to get the most value out of the exercise is going to costs (AWS and development time). That said, having something (even if it is a simple test to gather basic metrics once in a blue moon) is better than having nothing at all. We don't really want to find out from the end user on the 11th hour complaining that it is too slow for purpose. Without a benchmark, we wouldn't know if the system is able to scale (uploading one or two datasets from a developer's machine may not highlight some of the underlying problems with performance). My vote is to setup a simple test for now, and add additional metrics / measurement as the need arise.