kubeflow / testing

Test infrastructure and tooling for Kubeflow.
Apache License 2.0
63 stars 89 forks source link

Alternative solution to removal of test on optional-test-infra #1006

Open annajung opened 2 years ago

annajung commented 2 years ago

[UPDATE June 14th, 2022] This is no longer a blocker for the 1.6 release as all WG have pivoted to using GitHub Actions as a short-term solution

~This is a blocker for the 1.6 release~

This issue is to track an alternative solution to the recent removal of existing presubmit and postsubmit tests on optional-test-infra.

As of May 31st,

References

@kubeflow/wg-automl-leads @kubeflow/wg-training-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-pipeline-leads @kubeflow/wg-manifests-leads @pvaneck @yuzisun @kubeflow/release-team @akartsky @surajkota

annajung commented 2 years ago

Hi WG leads, there is a discussion happening in parallel with AWS, but I also like to kick off a discussion to see if you would be transitioning to a different CI/CD pipeline before the release. If so, do you have any timelines in mind?

jlewi commented 2 years ago

Hi folks; its been a minute.

As a bit of additional context. The issue of a sustainable and scalable approach to test infrastructure was identified almost two years ago in kubeflow/testing#737.

What is the current thinking in terms of making each WG responsible for its own test-infra?

thesuperzapper commented 2 years ago

@annajung I have created a new private slack channel with all the working group leads to organize the creation of separate AWS accounts for each Working Group, and get AWS credits applied to them.

surajkota commented 2 years ago

As discussed in the Community Meeting 05/31, each Working Groups will need to setup their own testing infrastructure, as the optional-test-infra has been deleted.

AWS is willing to provide credits to the Working Groups for testing, to facilitate this please do the following:

The above is based on the assumption that a decentralised infra is more scalable and sustainable approach in the long term

jbottum commented 2 years ago

@johnugeorge @kimwnasptd @yuzisun @james-jwu @annajung This seems like a very generous offer from @surajkota. Would you please reply in a timely manner with your perspective / response? Thanks!

yuzisun commented 2 years ago

Thanks @surajkota !!

james-jwu commented 2 years ago

Overall sounds good. Pipelines will continue to use GCP infra provided by Google. /cc @zijianjoy

annajung commented 2 years ago

Hi everyone, here is an update from the June 6th release team meeting

As for the 1.6 release, need to confirm with @johnugeorge @andreyvelich to see if Training Operators and AutoML WG can meet the June 15th feature freeze, if not, will work with them to determine if another extension is needed and update the community accordingly

ca-scribner commented 2 years ago

If anyone needs assistance setting up github CI for their working group, etc., please reach out (on this issue or in Kubeflow Slack). I'm not an expert, but I have some experience and happy to put some time in.

annajung commented 2 years ago

Some notes from the June 7th Community meeting,

surajkota commented 2 years ago

update: I was trying to find out if there is a way to create an AWS account w/o payment info like credit card but havent found any documentation or a way to get exception for it. I will reach out to some more folks and will update if it is possible.

General requirement as per Customer Agreement, all AWS Accounts must have a valid form of payment to access our services. https://aws.amazon.com/agreement/

surajkota commented 2 years ago

Hi folks, unfortunately, creating an AWS account requires adding a valid payment information. I could not find any way to request an exception for this and creating an internal AWS account would not work for our use case.

Here is a proposal which can aims to create a maintainable and sustainable path forward on this:

Creating a maintainable AWS Account

  1. Create a non-personal email which can we owned/shared by WG leads. This way it can be handed over if people change overtime.
  2. Create an AWS account associated with this email with valid payment information. Lets call this management account.
  3. Since we want to create separate accounts for each WG, instead of creating individual accounts, we can create an organization within the management account and create member accounts for each WG within this organization. The benefit is orgnanizations have consolidated billing and so the AWS credits can be applied to the management account and can be shared by member accounts.
    • With this approach WGs will have flexibility w.r.t account, for e.g., each working group can decide to create a testing account and a separate production account or maybe one production account for hosting released artifacts(samples, container images, charts etc.) for whole of Kubeflow Making this approach sustainable in the long term
  4. Next, I want to propose creating a mechanism which can be used to ensure AWS account is accessible and funded appropriately throughout the year.
    • Add an item to the release checklist(in the beginning of release cycle) in which WGs:
    • i. Would review the credits spent and remaining over the last quarter and determine if there are sufficient credits to complete the current release. If there is a need to renew for the NEXT release, connect with AWS. This will allow sufficient time on both sides to complete the process
    • ii. Baseline the accounts to make sure only active members have permissions to the account
    • iii. Baseline the list of point of contact information from AWS
    • iv. Add/update the information related to accounts, maintainers, AWS points of contacts, infrastructure access etc. to a document or README
    • (Optional) We can add POC from AWS to have access to these resources if WGs thinks it is needed(this was brought up in some discussions)
  5. Set up Alarms to detect when account is running low on credits or spending exceeds expectation
  6. Setup best practices and guidelines for: adding people to account, deploy using IaC etc which can be flushed out later

Action item: The question about adding a valid payment information to the account still remains, and hence I would like to ask, is there any other organization/company which is willing to partner here for adding a valid payment information to the management account stated above?

Please let us know what the community thinks about this proposal.

cc @akartsky @jaypipes

surajkota commented 2 years ago

@kubeflow/wg-manifests-leads @kubeflow/wg-automl-leads @kubeflow/wg-notebooks-leads @yuzisun @james-jwu

Please let me know what the community thinks about the above proposal assuming we have a partner for payment information. This will help us with #1008 as well and hence a timely response will be helpful.

kimwnasptd commented 2 years ago

Thank you very much for driving this @surajkota!

This proposal seems solid for allowing all WGs to share the same credit pool. Thumbs up from manifests and notebooks.

Add an item to the release checklist(in the beginning of release cycle) in which WGs

I really like this approach as well. Since it will ensure we have a cadence for the status checks.

jbottum commented 2 years ago

@surajkota thanks for moving this forward. I have been asking companies (that provide integration services for Kubeflow on AWS) to support of this effort. I believe that we need to scope the effort i.e. One headcount is needed to 1) manage the accounts and credits and 2) config, operate, tear-down the clusters on the testing infra. 3) the period of time i.e. 12 months. Additionally, the responsibilities and SLA need to be defined i.e. only for current release i.e. 1.6, change requests will be tracked, acknowledged and implemented based on a simple approval process. Finally, IMO, the companies that provide testing infrastructure and related services so be given a special designation by the Community. This is an investment that test infra operators are making and (IMO) the Community should provide a designation / benefit back to these contributors.

charlesa101 commented 2 years ago

Thank you @surajkota. The proposal looks solid. We do a lot of work with Kubeflow, my company MavenCode will be able to provide the needed partnership support to get this going.

@kubeflow/wg-manifests-leads @kubeflow/wg-automl-leads @kubeflow/wg-notebooks-leads @yuzisun @james-jwu

Please let me know what the community thinks about the above proposal assuming we have a partner for payment information. This will help us with #1008 as well and hence a timely response will be helpful.

songole commented 2 years ago

Thank you @surajkota @jbottum for the proposal. We are very much interested in contributing to the effort. I am part of dkube.io and our product DKube is built on top of Kubeflow and MLflow and provides MLOps and Monitoring solutions to enterprise customers.

We look forward to partnering with other community members and providing the needed support

jbottum commented 2 years ago

@surajkota I believe that we said that interested parties should respond by COB today. It appears that we have Arrikto, Maven Code, One Convergence and @ca-scribner offering support. @annajung Perhaps we should ask the contributors to select a Working Group to support? @kimwnasptd @charlesa101 @songole do you have a preference for a working group to support ? I think it would be good to have representatives 2+ companies in each working group.

songole commented 2 years ago

@jbottum We like to represent the following working groups: AutoML, Pipelines, Training and Serving.

charlesa101 commented 2 years ago

@surajkota I believe that we said that interested parties should respond by COB today. It appears that we have Arrikto, Maven Code, One Convergence and @ca-scribner offering support. @annajung Perhaps we should ask the contributors to select a Working Group to support? @kimwnasptd @charlesa101 @songole do you have a preference for a working group to support ? I think it would be good to have representatives 2+ companies in each working group.

@jbottum - automl, notebook, manifest, pipelines but we are open to support any other WG

jbottum commented 2 years ago

@surajkota did you get a credit card for the AWS account from a partner? Do you need the credit card to move forward?

jbottum commented 2 years ago

@kimwnasptd @songole @charlesa101 @ca-scribner In the Release team meeting today, we discussed next steps. We propose that the parties interested (Maven Code, One, Arrikto and CA-Scribner) should contact the Working Groups, and create a PR for the test-infra config and operations effort. The Issue/PR should propose a design for the test infra and support. Is that a reasonable request ?

Please note that this issue (1006) will be used to track the account set-up, and the config and operations of the test infra for each working group should have an independent issue / PR. @surajkota @annajung @DomFleischmann please confirm that I captured this correctly. Thanks.

jbottum commented 2 years ago

@johnugeorge @kimwnasptd @pvaneck - @surajkota needs an estimate of each Working Group's expenses for the next 12 months. Please submit by Friday(July 1) for Manifests, Notebooks, Training, Katib/AutoML, and KService. Please use AWS Pricing Calculator (https://calculator.aws/#/). cc'ing @annajung

surajkota commented 2 years ago

Hi everyone, the initial proposal required us to attach a credit card per WG account. The current proposal that uses AWS Organization approach requires only one credit card which needs to be added to the management account since it offers consolidated billing. I propose that we move forward with Arrikto's payment information for the management account since @kimwnasptd has been testing it out and was the first one to respond.

Thank you Maven Code, One Convergence, Arrikto and CA-Scribner for the interest in this initiative. Creating the management account is the first step of this project. It is exciting to see all the folks who are interested to contribute to this effort and I am confident the WGs will appreciate all the help they can get to make this effort useful for the product!

johnugeorge commented 2 years ago

@surajkota One question. Will credit card again become a single point of failure for the management account similar to earlier personal account for AWS infra ? How can this be handled?

surajkota commented 2 years ago

@johnugeorge All credits cards have an expiration date so adding more than one would not be adding much value IMO. If one company wants to remove their payment info in future, we will do another callout and also have this issue as reference in case we want to reach out to others who expressed interest in this.

kimwnasptd commented 2 years ago

Apologies for the late reply here. First of all @songole @charlesa101 nice to meet you! I'm sure WGs would be more that happy to have some more engineering firepower for the testing, thank you very much for the interest!

As @surajkota described above we are splitting the testing infra migration into 2 orthogonal efforts:

  1. Establishing a process for a team that will be responsible for the root AWS account, that will be funded from AWS, as well as how the WGs can use that account in a secure manner
  2. Deciding on how the testing infra for the affected WGs will be, how will we set up CI/CD, ECR registries etc.

1. Management root account

For the first part we have made progress and created the initial management account and we will create an AWS organization, in which WGs can join with an email they will own. The practical part for this is almost done, and what remains is for each WG to use the AWS Pricing Calculator to estimate their credit needs for the next 12 months.

The pricing calculation part is crucial as without it we can't bootstrap the process. So we kindly ask the WGs interested in this to provide such an estimate by the end of this week, early next one. This will hugely help @surajkota as well to push for this, since this will require some communication to get the credits in.

For Notebooks WG @thesuperzapper and I are already in the process of calculating the cost and will post an update tomorrow.

Lastly we are preparing a basic proposal for on the team responsible for the management account. Specifically we want to document:

  1. What are the selection criteria for members of that team
  2. What are the expectations and time commitments from that team
  3. Actions and setup that needs to happen within that account

2. Setting up the infra per WG

@songole @charlesa101 @ca-scribner for this part I highly suggest to reach out to the WGs you are interested it to discuss your thoughts and expertise on how to setup the CI/CD. We can then form proposals and even generalizing a solution across WGs once we have a solid understanding and approach.

You can find links for all the WG's calendars and info in https://github.com/kubeflow/community/blob/master/wgs.yaml#L80

cc @kubeflow/wg-automl-leads @kubeflow/wg-training-leads @pvaneck @yuzisun

surajkota commented 2 years ago

Hi everyone, thanks to all the WGs for the estimates, we got the management account created and credits approved!

Next steps: I have a draft of the design with the next steps on this document: https://docs.google.com/document/d/1Z3K4q21Vko6SzQDu2JSov9DO2fRehsDB_X9Z663fym4/edit?usp=sharing and we will be looking into setting up the AWS organization to get this going.

As we previously discussed, each WG can choose its own test/release infrastructure depending on their requirements and it would be running in separate accounts. I am looking for contributors and WGs to come up with the requirements for the testing infrastructure or a proposal for the infrastructure based on their requirements (Infrastructure per WG section of the doc). I have laid out a high level expectation for each of the section, please ping me or request access on the doc if you would like to contribute. Can we target to have a draft by 07/27? cc @songole @charlesa101 @ca-scribner @kubeflow/wg-automl-leads @kubeflow/wg-training-leads @kubeflow/wg-notebooks-leads @kubeflow/wg-manifests-leads @pvaneck @yuzisun

songole commented 2 years ago

Thanks @surajkota. Someone from my team would start with training wg. @mak-454 @anil3

charlesa101 commented 2 years ago

Thanks, @surajkota - We can start with the notebooks & manifest wg. Thank you!

surajkota commented 2 years ago

Hi @kubeflow/wg-automl-leads, @kubeflow/wg-training-leads @kubeflow/wg-manifests-leads, @kubeflow/wg-notebooks-leads, @pvaneck

We are looking into creating the AWS organization and organization units for each WG using Infrastructure as code based on the proposal in this doc. Everyone has already looked at brief overview on this issue but the document will go into details. If you have any comments or would like to contribute to any of the TODO items, please let us know.

Following are the things we need from your end:

  1. We want to use an IaC tool and not have manual creation for the AWS organization. Does the community have a preference for using CDK or Terraform?
  2. Please help me with an email addresses by EOD 07/20 you would like to use for your WG account. Let me know if we should go ahead and create one on your behalf. We can create something like: kf-wg-manifests-test@gmail.com , kf-wg-training-test@gmail.com and share it with each of the WGs.
    • Manifests WG: @kimwnasptd, Email: ?
    • Notebooks WG: @kimwnasptd, Email: ?
    • Training WG: @johnugeorge, Email: ?
    • AutoML WG: @johnugeorge, Email: ?
    • KServe Project: @pvaneck, Email: ?

Please let me know if you want to designate anyone else in the WG for this.