andrewjbtw commented 1 year ago

Why reset the testing environments?

The stage environment has been running continuously since 2019. The QA environment has also been running for over a year. Both have accumulated a lot of data. Both have accumulated a lot of bad data. Bad data in the testing environments is a tax on all current and future development and the tax grows every week.

We need the stage and QA environments to reasonably approximate production so that:

We can expect changes that pass tests on stage to work as expected in prod
We can identify bugs in a safe environment for testing without having to use prod objects

Some problem data is inevitable in testing environments because that's a part of testing. Either we create problem data to test the system or the function we are testing creates problem data. When the volume of problem data is small, it can either be remediated or ignored and it's not a drag on the rest of the system.

The problem with our current environments is that the volume of bad data is not small, and the cost of remediating it is also not small. Problem data

can generate HB errors simply by existing (as can happen when indexing, validation, or audit finds it)
can interfere with deployment
can interfere with analysis
can interfere with testing (automated and manual)
makes our error reporting dashboards nearly unusable because it's difficult to tell what's a current issue and what is old
can lead to extra time spent investigating problems that turn out not to matter because the test data is irrelevant to current concerns

The development team, product owners, the repository manager, and anyone else who uses stage for testing sees the cost of bad data on a regular basis. For the development team and the repository manager, it's become a constant, ongoing cost every week.

We are also running out of space in the stage "preservation" storage and should not keep expanding it. SDR has no simple way to delete data on a case by case basis without generating more errors.

Since stage is not actually a preservation system, we should be able to clear everything out and start over.

Benefits of starting over

It will run faster and cleaner, at least at the start and for quite some time afterwards
We can be more confident that issues we see are new and current
We can be confident that all test data originates in Cocina. We migrated stage and QA when we migrated prod but prod got dedicated data remediation that the test environments did not. New data will ensure that there are no Fedora-origin problems left over. We left Fedora long enough ago that we don't need to keep migrated data for testing. New data should be fine.
It will help us resolve a longstanding problem where the stage and QA environments share the same preservation storage mount and interfere with each other as a result. This is very hard to clean up in place. We could start over with separate mounts for separate environments.

What do we need to start over?

The SDR environment has to be seeded with certain objects to get started. (See https://github.com/sul-dlss/argo/issues/1782 for earlier discussion, but that was when SDR was based on Fedora.) We would need to make a full list but in terms of data we at the least need:

The Internal System Objects APO
Integration tests APO
Any other APOs needed for specialized workflows and accessioning paths: ETD, GIS, WAS, Goobi
SDR Graveyard APO for decommissions
At least one agreement
???

It would also be nice to have the Canonical objects because they took much effort to create and they represent different types of objects for use in testing Purl and sul-embed behavior.

With the foundational objects in place, we can create other test data as needed. There will be a judgement call on how far to go beyond the absolute minimum necessary to have a functional testing environment.

Nice to have

It would be nice to make this a repeatable process but not strictly necessary.

andrewjbtw commented 1 year ago

This will also require some coordination with stage users, especially Amy and Peter Chan because they actively use or create test objects there.

andrewjbtw commented 1 year ago

Potentially could be done in conjunction with https://github.com/sul-dlss/preservation_catalog/issues/1986