tensorflow / build

Build-related tools for TensorFlow
Apache License 2.0
277 stars 114 forks source link

Container Road Map #61

Open angerson opened 2 years ago

angerson commented 2 years ago

Road Map for Docker Containers

This is the same roadmap document that I'm using internally, with the internal bits taken out.

I am forcing these containers to get continuous support by using them for TF's internal CI: if they don't work, then our tests don't work. While I'm getting that ready during Q4 and Q1, I'm explicitly avoiding features that the TF team is not going to use, which would be dead-on-arrival unless we set up more testing for them, which I don't have the cycles to consider yet.

TF Nightly Milestone - Q4 Q1

Goal: Replicable container build of our tf-nightly Ubuntu packages

Release Test Milestone - Q4 Q1

Goal: Replicable container builds of our release tests, supporting each release

CI & RBE Milestone - Q4 Q1/Q2

Goal: The main tests and our RBE tests use the same Docker container, updated in one place

Forward Planning Milestone - Q2

Goal: Establish clear plan for any future work related to these containers. This is internal team planning stuff so I've removed it.

Downstream & OSS Milestone - Q2/Q3

Goal: Downstream users and custom-op developers use the same containers as our CI

bhack commented 2 years ago

Thanks for sharing the roadmap. It could be a little bit hard to understand steps mentioning "internal/our" requirements but I think it is expected.

Taking a look at the new Github Actions that we have here in the repository it is really super-clear what we are doing and when we are what on the OSS side with the limit to what we have orchestrated with Github Action.

When we are mixing OSS receipts/code and internal not visible stuffs/steps (e.g. like orchestration, args like commits for nightly etc..) it could be a little bit hard to follow the machinery if the not visible part is not compensated by some documentation details (e.g. what event/cron will start the scripts, what is the script chains, what are the args etc..).

But also having this documentation compensation generally it will bet under a constant risk to be outdated as probably internal teams have a direct visibility on the internal changes and so the operations will be not directly impacted by an outdated public documentation.

But as Github Actions rely on a well know and popular YAML dialect, and Github users/contributors/develoeprs are generally skilled on this dialect, do you think that it could be possible to setup a TF own self-hosted Github Actions runners on the Google Cloud so that we have a complete overview on the TF OSS build and orchestration and probably also a little bit of autonomy to the SIG without adding too much overhead to the system?

A Google Cloud team is maintaining all the tools to (auto)deploy self-hosted Github Actions runners on Google GKE: https://github.com/terraform-google-modules/terraform-google-github-actions-runners