nih-cfde / cfde-deriva

Collaboration point for miscellaneous CFDE-deriva scripts
Other
2 stars 3 forks source link

Export stability improvements #330

Closed karlcz closed 2 years ago

karlcz commented 2 years ago

Creating this stub issue to track QA/review... several maintenance changes have been made while working on other export features.

The export improvements described below are deployed in app-dev and incorporated into the latest "release" catalog as of 2022-02-23.

Problems we noticed during feature testing

  1. many exports could produce overly expensive queries which cannot complete during current synchronous export API request
  2. doomed exports could swamp the catalog with useless work and effectively cause a denial of service
  3. maintenance to add new C2M2 fields to export BDBags was cumbersome and involved lots of synchronized changes/redundant enumeration of data model details

Improvements that are now under test

Expected side-effects

Other notes

We are keeping timeouts rather short on the dev environment to make it easier to test and experience the revised behaviors. We can potentially increase these timeouts a bit in the subsequent prod release, but we should still anticipate that some exports are going to fail. A further development of an asynchronous export API and UX flow should be planned to allow large/slow exports as inventories and use cases expand.

Also, testing folks should realize that there are two reasons exports can be slow and hit timeouts:

  1. A very large number of records are included in export, e.g. sheer output size and throughput limits
  2. A complex set of filters means more complex query patterns, e.g. query processing cost even to produce smaller outputs

These competing trends may cause some surprises, e.g. where addition of a filter causes an export to be slower even though it would also be smaller if it succeeds. Then, adding even more filters might make it faster if the output size collapses far enough, even though it is a "more complex" set of constraints than the prior one.

ACharbonneau commented 2 years ago

So, what if someone truly does want all the files? Can users currently use the API to grab everything (or very large portions) of the database?

mikedarcy commented 2 years ago

In addition to the global export request timeout, an additional limit has been placed on the concurrency of export requests. Only one export per logged-in user (or client IP address if not logged in) is allowed to be processed at a time. This means that users will receive a 403 Forbidden error when using multiple browser windows/tabs to perform concurrent exports.

There is a potential side-effect for anonymous users for whom a unique client IP address cannot be determined by inspection of the HTTP_X_FORWARDED_FOR header. In such cases where the client IP cannot be determined or is found to be the same as another current export's client IP (e.g. due to a proxy), then this restriction will trigger and the user will have to retry the request later. It is recommended therefore to suggest to users that they login in order to provide export services reliably.

karlcz commented 2 years ago

I think we can adjust the production timeouts to make export of large portions work in practice with the current inventory and these new bits of plumbing. There will likely be some "knee in the curve" where output size and query complexity still lead to timeouts, before settling back into small exports working.

If the size of the inventory or other c2m2 usage characteristics shift again too quickly, it is hard to anticipate the feasibility. We do need to prioritize the async export API in order to get out from under this risk.

karlcz commented 2 years ago

Closing this issue which was completed in a prior release cycle. A newer issue #353 has some relevance to the topic of export performance due to changes to query indexes and DB tuning parameters.