Closed github-actions[bot] closed 1 month ago
I think our best bet for the future is to cut the release branch on core as the "freeze" trigger. Maybe core should freeze a couple of days before plugins?
Then backports to the release branch on core would only be for critical bug fixes. This would be in comparison to the past ~24 hours, where a flood of features get merged to main, backported to 2.x, and backported to release.
With the update to release+1 across main and 2.x, it would only disrupt merging of a small number of critical bug fixes (versus the current flood of last-minute features).
(Of course we should also stop trying to squeeze features in the last 24-48 hours before a release, but try telling that to people whose paychecks depend on shipping features.)
Coming from https://github.com/opensearch-project/opensearch-build/issues/4908#issuecomment-2330707336, I think that it would be nicer to have a trigger for the manifest update job to publish artifacts for branches that have been cut. Once a branch has been cut, the expectation is that the branch is ready to build and artifacts can be published. We then increment the version on 2.x to the next minor version and bwc gets an update as well. Once 2.x is onto the next minor version and has the current minor version as bwc, gradle checks on main that trigger the bwc test expect to have build artifacts ready for the current minor version. But since the manifest update runs on a cron job, it can take some time for this to happen and PRs on main start failing. This happened adter the 2.16 branch was cut as well as on 2.17.
A better way would be to trigger a manifest update along with the branch cut so that the artifacts are available immediately.
Delay in merging version increment PRs causing build failures and delay in RC generation. reportsDashboards: https://github.com/opensearch-project/dashboards-reporting/pull/404
ISM and ISM Dashboards had failing tests in 2.16, that again got flagged in 2.17 release testing. The tests were failing due to logic in ISM tests that would cleanup test suites by deleting ISM system indices.
Originally planned in 2.16 (now 2.17), the security plugin introduced a change to identify if a request matches system indices by checking a central System Index Registry from core. The index patterns in the registry are indices registered with the SystemIndexPlugin.getSystemIndexDescriptors extension point. Before 2.17, the security plugin relied on an opensearch.yml
setting to be aware of the system index patterns in the cluster. The list that is provided in the demo configuration can be found here and has never included ISM indices. As a result, ISM indices were treated as regular indices (from security plugin perspective) and regular users could access the indices as any other index. Security gives special protections to system indices and forbids regular users from having direct access. Actions like writing to the index and deleting the index are strictly forbidden for regular users. Plugins needing system index access have mechanisms for accessing the system index when needed.
ISM tests were flagged in 2.17 again because the CI Checks in ISM and ISM Dashboards repos are insufficient.
The existing ISM tests with security are only operating on a subset of integ tests. Opened a PR here to address the ISM plugin issues.
ISM Dashboards has cypress tests that are only testing w/o security for PR checks. I opened a PR to address the failing checks for ISM Dashboards, but there should be a follow-up change to add a PR check to run cypress tests with security. There was also a change made in the FTR repo where there is similar test cleanup logic that removes system indices.
@cwperks The real problem is not ISM not making required changes on time to support breaking change pushed by security but security making breaking changes without any campaign. We have seen other instances in case of CA certs where changes were pushed by security which led to failures in other plugins. Adding cypress test in PR is still a reactive approach where a plugin spends time investigating, figures out that issue is not related to the plugin, follows up with security and gets the issue fixed. I don't see any value in this approach although I agree on the overall AI of executing tests on PR workflow runs.
@vikasvb90 The ISM tests are an issue that needs to be addressed. If the ism indices are supposed to be treated like regular indices then there is no need to register them with the SystemIndexPlugin.getSystemIndexDescriptors() extension point.
An index is a system index if its important to the integrity of the cluster and would cause a cluster to enter a corrupted state if deleted. For example, the security index (.opendistro_security
) is a system index that contains the security posture of a cluster. Deleting the security index would render a cluster corrupt because the nodes would not be able to read the security posture.
All plugins should run integ tests with and without security to avoid catching test failures only at release time.
For the demo cert renewal, the demo certs were initially updated to modify the SAN to include IPv6 loopback address. @DarshitChanpura updated the certs and opened PRs on all repos that maintained copies of the certs, but ISM was missed in 2.13: https://github.com/opensearch-project/security/pull/4061
In a future release, I'd like to run a campaign across plugins to remove copies of demo certs and instead to refer to the certs centrally. See example of how SQL plugin pulls them from the security repo here: https://github.com/opensearch-project/sql/blob/4303a2ab755d53903094dd94a5100572677a27a1/integ-test/build.gradle#L107-L111
@cwperks I am not questioning the intention of the change. I do agree that change needs to be there but they need to be called out proactively as a campaign or as any other mechanism. By the time plugin finds this out and calls out, it is already too late. Also, I understand that issues happen and we can take reactive measures to fix them but attempts should at least be made to reduce their possibility by taking proactive measures before pushing breaking changes.
@vikasvb90 One of the biggest pro-active measures to catch these early-on is to add a CI check that runs tests with security enabled, since we run security-enabled tests for all RCs. This would enable plugin owners to debug any failures, and/or reach out to security team to get these addressed as soon as possible.
@DarshitChanpura That's not being pro-active. If you rely on foreign plugins to come and tell you that something is buggy in security then that is being reactive. And it ends up wasting a lot of dev cycles. Tests in other plugins should just be treated as last line of defense.
@vikasvb90 -- what procedural change are you suggesting? You are saying no to improved CI checks in one plugin (which seems like a ridiculous argument to be making -- "No! I don't want to add more tests to cover my plugin in more scenarios!"). What do you think would be better?
Should the security plugin have an integration test that runs all plugins' integ tests with security enabled?
If you rely on foreign plugins to come and tell you that something is buggy in security then that is being reactive.
Also, what's buggy about the security plugin trying to protect system indices? That seems like a good idea and the intended behavior of the change. What's the bug?
Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue.
OpenSearch-Dashboards. See data here&_a=(description:'OpenSearch%20Release%20Build%20and%20Integration%20Test%20Results',filters:!(('$state':(store:appState),meta:(alias:!n,controlledBy:'1721852613904',disabled:!f,index:'16f55f10-4977-11ef-8565-15a1562cd0a0',key:version,negate:!f,params:(query:'2.17.0'),type:phrase),query:(match_phrase:(version:'2.17.0'))),('$state':(store:appState),meta:(alias:!n,controlledBy:'1722482131538',disabled:!f,index:'16f55f10-4977-11ef-8565-15a1562cd0a0',key:rc_number,negate:!f,params:(query:8),type:phrase),query:(match_phrase:(rc_number:8)))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!t,title:'OpenSearch%20Release%20Build%20and%20Integration%20Test%20Results',viewMode:view))
OpenSearch components: See data here&_a=(description:'OpenSearch%20Release%20Build%20and%20Integration%20Test%20Results',filters:!(('$state':(store:appState),meta:(alias:!n,controlledBy:'1721852613904',disabled:!f,index:'16f55f10-4977-11ef-8565-15a1562cd0a0',key:version,negate:!f,params:(query:'2.17.0'),type:phrase),query:(match_phrase:(version:'2.17.0'))),('$state':(store:appState),meta:(alias:!n,controlledBy:'1722482131538',disabled:!f,index:'16f55f10-4977-11ef-8565-15a1562cd0a0',key:rc_number,negate:!f,params:(query:6),type:phrase),query:(match_phrase:(rc_number:6))),('$state':(store:appState),meta:(alias:!n,controlledBy:'1721929304719',disabled:!f,index:'16f55f10-4977-11ef-8565-15a1562cd0a0',key:distribution_build_number,negate:!f,params:(query:10292),type:phrase),query:(match_phrase:(distribution_build_number:10292)))),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),timeRestore:!t,title:'OpenSearch%20Release%20Build%20and%20Integration%20Test%20Results',viewMode:view))
@msfroh My intention is to handle common cases like this gracefully by running a campaign where security can ask for sign-offs with a reasonable amount of bake time from all the dependent plugins (List can easily be fetched by looking at build.gradle). Beyond the bake time, security can take a call to merge the change. No response from dependent plugins can be treated as a NO_BREAKING_CHANGE as well.
@msfroh My intention is to handle common cases like this gracefully by running a campaign where security can ask for sign-offs with a reasonable amount of bake time from all the dependent plugins (List can easily be fetched by looking at build.gradle). Beyond the bake time, security can take a call to merge the change. No response from dependent plugins can be treated as a NO_BREAKING_CHANGE as well.
Yeah -- that seems like a reasonable ask. I would skip asking for sign-offs, personally. I would open issues saying "We're going to do X in a few days unless we hear back from you." Essentially, it would be opt-out instead of opt-in.
@gaiksaya Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue. ....
I think we also need a campaign to reduce the flaky tests to 0. The negative impact of existing flaky tests is multi-fold (impacting all components) on development and code merges, with developers waiting/hoping (sometimes days) on retries to succeed because of gradle check failures due to flaky tests.
@gaiksaya Below components had flaky tests during 2.17.0. We request component owners to work with us if you think CI system is the issue. ....
I think we also need a campaign to reduce the flaky tests to 0. The negative impact of existing flaky tests is multi-fold (impacting all components) on development and code merges, with developers waiting/hoping (sometimes days) on retries to succeed because of gradle check failures due to flaky tests.
+1 on this. Core having flaky tests have been flagged so many times. @msfroh one reason of stalled PRs in core could because of gradle checks taking forever to complete and then failing due to flaky tests. I know there has been a lot of brainstorming on this, I have one suggestion that might help reducing the time for gradle checks:
k-NN team had a learning in this release where we identified that all the integTests that runs during CI runs without heap CB enabled but Jenkins pipeline uses the RC to create the cluster. This RC candidate default settings is different from integTest cluster. Heap CB not enabled was one and this lead to integTest failures for k-NN in jenkins pipeline while same tests was successful in the CIs. More details can be found on this GH issue: https://github.com/opensearch-project/OpenSearch/issues/15849 I have added the capability to override the setting to enable heap CB if required by plugins ref: https://github.com/opensearch-project/OpenSearch/pull/15906. But I think this is something which all plugins should enable in their CIs.
The doc team received a lot of last-minute PRs and even issues that were entered beyond first RC date. We are suggesting the following to ensure that the RC entrance criteria from the docs side are met:
Hi,
I tried to summarize the points mentioned in this issue under what needs to improve
section on the retrospective board https://github.com/orgs/opensearch-project/projects/205/views/16?filterQuery=category%3A%22v2.17.0+Retro%22+
Please feel free to review, add and update if required. Thanks!
Thank you for all your comments and suggestion. We have created issues and added the existing ones in the actions items section in the above board. Closing this one.
Related release issue?
https://github.com/opensearch-project/opensearch-build/issues/4908
How to use this issue?
Please add comments to this issue, they can be small or large in scope. Honest feedback is important to improve our processes, suggestions are also welcomed but not required.
What will happen to this issue post release?
There will be a discussion(s) about how the release went and how the next release can be improved. Then this issue will be updated with the notes of that discussion along side action items.