giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

IDP demo journey #3470

Open mproffitt opened 3 months ago

mproffitt commented 3 months ago

User Story

Details, Background

In order to take the user on a journey through the IDP, we have an overall story of creating infrastructure components via crossplane and deploying an app that then consumes that infrastructure.

To accomplish the story and really showcase the capabilities of all components in the pipeline the journey is as follows:

  1. A user will access backstage and either create a new cluster or select an existing cluster to deploy the application to
  2. Once the cluster is ready, the user will select a template inside backstage that is linked to a pre-defined github template repository
  3. The user will fill out the project name and possibly some other details related to the template
  4. When the user clicks "submit", a Kratix Promise CR is created. This CR will set up the github repo and bootstrap the repository into the cluster ready for flux to reconcile against
  5. When flux reconciles the repository it will deploy a crossplane claim that creates infrastructure inside AWS. For the purposes of the demo this will be
    • a VPC with peering connections to the cluster VPC (note - backstage/kratrix will have to hand-off the cluster VPC name)
    • An RDS Database cluster
    • An Elasticache cluster using redis as a backend
  6. Flux will additionally deploy an App CR which will be consumed by App Platform to be deployed to the cluster. This will be a small application that a user can interact with which:
    • writes information to the database
    • reads information from elasticache
    • if the cache reports cache-miss, updates the cache with the latest information from the database and then repeats the second step to send this back to the user

Flow diagrams: https://miro.com/app/board/uXjVKnjQei8=/

Architecture

high-level-platform-arch drawio

Blocked by / depends on

- [ ] https://github.com/giantswarm/roadmap/issues/3472
- [ ] https://github.com/giantswarm/roadmap/issues/3429
- [ ] https://github.com/giantswarm/roadmap/issues/3583
- [ ] https://github.com/giantswarm/roadmap/issues/3469
- [x] Set up payment for DemoTechInc to be able to speed up builds
- [x] Insert the correct module name in go.mod (currently laszlo-kratix-14)
### Improvements
- [x] Remove database cluster creation from scaffolder template
- [x] Scaffolder UI should only provide fields which are needed
- [x] Demo app UI needs improvement in a few places
- [x] Grafana dashboard for metrics and logs exposed by our service
- [x] Link to Grafana dashboard from Deployment
- [x] Scaffolder overview should provide more structure, context per item
- [ ] Improve build times in Github actions
- [x] Remove "Publish Infrastructure PR" action from scaffolder template
- [ ] Provide more description in pull requests created by scaffolder
- [ ] Pull requests should get validated via GitHub actions (yamllint, schema validation)
puja108 commented 3 months ago

Shouldn't it be a Kratix resource request? I thought the kratix promise is the template/definition of what should be done, and then a request basically instantiates that once, or am I mistaken?

mproffitt commented 3 months ago

@puja108 My understanding is that the Promise is kind of like a crossplane composition and the CR is a claim against that promise. Admittedly I'm not familiar with the terminology of Kratix at this point, perhaps @piontec has a clearer definition that can fit in here.

mproffitt commented 3 months ago

Updated to add high level architectural flow

puja108 commented 3 months ago

Nice, I guess "deploy app" step also includes create namespace for you? I would honestly pull that step out and put it next to "create cluster" as for platform teams the creation of a new project/team namespace is quite an important step that includes OIDC/RBAC setup, quotas, etc. so having it more or less as a "module" would make a lot of sense.

mproffitt commented 3 months ago

That's an interesting thought - I think my main question here would be "what would be the delivery mechanism" - I can see two possibilities here - We provide a pre-packaged helm chart that can template the namespace, roles, quotas, etc and can be delivered via the app platform

The second option is to use flux to deliver these via a kustomize base path

I'd probably lean towards flux here for sake of simplicity and extensibility

@piontec What are your thoughts on this?

LutzLange commented 3 months ago

I've seen different scenarios for development environment vs. production environment.

You want to make developers as productive as possible, while production environments are stricter and controlled.

Which use case are we targeting here exactly? Is it A) Resources for developers to iterate fast? B) Rolling an Application out into a production or pre production environment?

I do understand showing infrastructure management, but feel we should be able highlight the value of what we are delivering. In an Integrated Development Environment, I would want to show a flow that allows a developer to quickly go through code / deploy / feedback / fix cycles.

mproffitt commented 3 months ago

but feel we should be able highlight the value of what we are delivering

Do you care to elaborate more on how this is not highlighting the value in what we are delivering?

On the contrary I am firmly of the belief this showcases the skills the team has to offer by bringing together a number of disparate tools into a single cohesive journey in a way that customers are already requesting capability towards. Having the ability to demonstrate that is IMHO an incredibly powerful tool, and one we do not have in our arsenal today.

Perhaps I should clarify. An IDP is nothing to do with what happens in an engineers local development environment. IDP in this instance relates to an Internal Developer Platform, an interaction point between engineers and the clusters, and specifically on the portal side (backstage) a place they can go to what is happening inside the cluster and jump off towards other tools and applications that help them with this understanding.

By understanding what is going on inside the cluster, engineers are empowered towards the products they themselves manage.

The IDP would be a place they can construct deployments from off-the-shelf products, be that applications delivered as community driven helm charts, infrastructure delivered as crossplane compositions or new applications they are building driven via github template repos represented as kratix promises inside the cluster and have them delivered to the cluster via continuous deployment (flux) This is what this demo journey is showing.

A) Resources for developers to iterate fast?

This is covered in that engineers can use the platform to quickly bootrap new applications into the cluster.

B) Rolling an Application out into a production or pre production environment?

We do not care what environment the engineers are rolling to - all clusters are equal. We care only that the journey is the same irrespective of the target environment and in doing so, take away some of the pain of managing cross-environment deployments

In an Integrated Development Environment, I would want to show a flow that allows a developer to quickly go through code / deploy / feedback / fix cycles.

I do not get the relevance here. We are not interested in what happens during application development lifecycle. IDEs are not a topic I plan to support and is certainly out of scope of this journey.

LutzLange commented 3 months ago

This might be the wrong place for this discussion...

I do think your demo is a valuable tool that we don't have today. I'm just looking at the overall story and am wondering if we need another demo as well.

Our current iteration of the story & the value is : We allow your platform engineering team to focus on what it most important in your company: Making developers more productive to speed up innovation and drive business agility and business outcomes. If developers can iterate quickly, you business stays agile and can adapt quickly to changing demands.

I've seen many platform demos. And I don't deny there is value in doing them. It is a lot of work to build these. Though the demo can be a bit unimpressive in the end ( It should be ). It often comes down to pushing a button to commit a change and then automation kicking in and delivering a new / changed cluster.

Value to the business is in developers iterating faster, delivering better software quicker. Our story of "freeing up the platform engineers to be able to enable developers" is what we can show with this demo.

This feels like a bit of a stretch to me, like we should try and do more. As if we free you up to do important things but will not be able to help you with these more important tasks. Would it not be even better, if we could show that devs can actually iterate faster with our IdP? A comparison of the situation / work without the IdP and the situation with the IdP. This could be a combination of slide ware and demo ( demo with the IdP ).

I know that this is not where we are today. As I said in the beginning, this might not be the right place.

mproffitt commented 3 months ago

Architecture diagram updated to include the separation of delivery of components such as namespace, quotas, permissions, etc.

These will be delivered to the cluster using the out-of-band delivery method as described in out gitops-template

puja108 commented 3 months ago

We provide a pre-packaged helm chart that can template the namespace, roles, quotas, etc and can be delivered via the app platform

We might need to talk to Big Mac here, as they already have some RBAC helper app IIRC and this gets close to their access management ownership. Maybe that app could be provided by Big Mac and deployed by whatever means Honeybadger feels most adequate.

puja108 commented 3 months ago

As for @LutzLange comments, I feel this is beyond the scope of this PoC/demo.

This here is just about the getting started quickly step, i.e. setting up a new project with all the bells and whistles (we could over time add additional templates like e.g. for security or o11y to this). This is a big value driver that many current and most potential new customers have been asking for or even working on themselves.

Fast iteration cycles once the project is set up might be influenced by this as everything is set up right and we try to have all environments similarly configured. But there might be other things to show there, which would be based on other features we might work on at some point in the future, e.g. Flagger for canary deployments, automatic branch deployments, o11y setup and validation feedback for these,...

weatherhog commented 1 month ago

@mproffitt as the solution architect for the IDP Demo and I sat together to summarise where we are:

intermediate status:

Backstage:

Some complexity got moved to crossplane, but should not be in crossplane. From User perspective this should be outside of crossplane, mainly everything regarding the region. This requires some complexity in Backstage. The question is, is this complexity already available in backstage (based on post in #news-dev, the Installations page already shows Region information).

Kratix At the moment we need to ask ourselves which role does Kratix have in our Developer Platform and do we really need it. @piontec made it work to bootstrap the app-template Demo. @uvegla is figuring out details to make this work on golem, as it was only working in KinD clusters so far. The initial idea was that Kratix solves some pain points by setting up the gitops Repo, bootstrapping it to the cluster and reporting it back to backstage. This should solve the pain points that customers need to create their own gitopsRepo and creating secrets to bootstrap it to the cluster. We have seen issues with this before and need to solve this. It seems like Kratix is not the right solution for this, as Kratix controls the resources. We need to rethink what Kratix role is in our platform (this is not about dropping Kratix, more about refining what Kratix role is)

Crossplane

Demo App (app that gets deployed into the demo cluster)

The question that we need to ask ourselves, do we need to write something on our own in Go, or can we be happy with the app being written in python, which already exists. The goal is to show the platform and how it works and the goal should not be to show that we can write apps in Go - which is also clear because we have plenty of apps in Go.

Completeness of our components (approximately):

After these tasks are done, the final task is to put everything together and test the whole platform end-to-end.

Nice to have extension points (can also be discussed):

cc: @marians @gusevda @piontec @mproffitt @uvegla

mproffitt commented 1 month ago

This is the more complete architecture as used in the interim demo

Image

puja108 commented 1 month ago

I'd like to phase this work a bit more so we can focus on getting a minimal thing out that is demoable soon.

To me that would mean

phase 1

phase 2:

And at the same time, once @piontec is back we can discuss direction with Kratix and if what we are intending to do with it (aka API to Git) would work or not, but pull it out of this demo for now and continue in Kratix specific epic.

puja108 commented 1 month ago

As for:

Is the workload cluster a prerequisite or do we want to create the workload cluster with the platform?

WC creation is out of scope. Could be a separate demo where we aim for WC creation from Backstage. Needs issue.

deploying additional resources like namespaces, quotas, permissions, rbac with the platform

Out of scope here, BUT to me this is the next separate demo/feature we should work on. Deploying a ready "environment" for a dev team to a cluster is a super common use case that also a lot of current customers have. This we should then do at least in cooperation with Big Mac, as they do have some early work towards at least the RBAC part of it. Also needs issue.

preselect components for your app in backstage (for example, choosing specific preexisting RDS and then creating Database inside that specific RDS instance), this could be important for demoing as component creation takes time

If possible I'd like to keep this out of phase 1. Could be part of phase 2. For the demo in phase 1 I would rely on two instantions, one where we have run the demo in advance and everything works, and one where we show the creation and switch from one to the other when we want to show things working.

mproffitt commented 1 month ago

not sure if transitgateway is needed here (is it?)

I would have very much liked to avoid transit gateway - It's the one network component I have always struggled with but unfortunately yes, it is needed.

The issue with requiring it relates to being able to get crossplane talking from the Management Cluster to inside the RDS database in order to set up the application database, users/roles and grants - this I considered to be important enough to add additional effort to get working - It's all well and good creating a database server/cluster but if that database cannot be used by the app that needs to consume it, then it's not really much use and there is really no other way to communicate to the RDS without using a TGW

It could be argued "just set up an additional peering connection to the MC" but the second part of this is purpose. TGWs and Peering connections serve different purposes with peering connections being best for high-throughput / low latency and TGWs used for everything else.

On a far more positive note, despite it melting my brain, the core of the TGW is now done - I built a basic version this afternoon and I'm fairly confident that this will just work.

No further optimisation of crossplane re region

I was less clear with your meaning on this point. If this is relating to the composition wrapper that first looks up a cluster, retrieves region and availability zone data, then feeds that to the next composition wrapper then the outer wrapper has not been tested and does not require any downstream changes, it is simply a passthrough that looks up some additional details that's probably best visualised as in the diagram below - any coloured box is a separate composition, white boxes are either endpoint compositions or specific MRs (or simply a reference to what data comes from where to start the inner wrapper)

Image

piontec commented 1 month ago

Hey all! I just came back and I need to sync with Laszlo, but a few comments from my side:

marians commented 1 month ago

Here is a high level process diagram I'd like us to maintain, so it represents what we are buidling. (Not finished yet)

https://miro.com/app/board/uXjVKnjQei8=/

weatherhog commented 1 month ago

the decision was made to follow the path with the Go App. Which still needs to be modified.

weatherhog commented 1 month ago

This is the list of apps deployed as part of the release

rdsapp                                1.0.1               2m54s        2m53s           deployed
rdsapp-app-operator                   6.11.0              2m45s        2m43s           deployed
rdsapp-aws-ebs-csi-driver-smons                           2m51s
rdsapp-aws-pod-identity-webhook                           2m51s
rdsapp-capi-node-labeler                                  2m51s
rdsapp-cert-exporter                                      2m51s
rdsapp-cert-manager                                       2m51s
rdsapp-chart-operator                                     2m45s
rdsapp-chart-operator-extensions                          2m51s
rdsapp-cilium-servicemonitors                             2m51s
rdsapp-cluster-autoscaler                                 2m51s
rdsapp-etcd-k8s-res-count-exporter                        2m51s
rdsapp-external-dns                                       2m51s
rdsapp-irsa-servicemonitors                               2m51s
rdsapp-k8s-audit-metrics                                  2m51s
rdsapp-k8s-dns-node-cache                                 2m51s
rdsapp-metrics-server                                     2m51s
rdsapp-net-exporter                                       2m51s
rdsapp-node-exporter                                      2m51s
rdsapp-observability-bundle                               2m51s
rdsapp-prometheus-blackbox-exporter                       2m51s
rdsapp-security-bundle                                    2m51s
rdsapp-teleport-kube-agent                                2m51s
rdsapp-vertical-pod-autoscaler                            2m51s
marians commented 3 weeks ago

Regarding Backstage UI wording:


Card title

Target entity -> Creation progress


Target component is not available yet. See Kratix resources for more information.

should be changed to:

For resources to be created, this pull request must be merged. After merging, it can take several minutes for resource creation to start.

Once resources get created, you can track creation progress in the Kratix resources tab.

Once the catalog entity exists, we show this:

The pull request defining these resources has been merged.

See resource creation details in the Kratix resources tab.

View the entity page for marians-demo-service to see more details about the component and its deployments.

marians commented 1 week ago

We are changing the demo flow as follows:

Rationale:

mproffitt commented 1 week ago

In my opinion this removes the capability of showing a very important aspect of the journey in that an app can be deployed along with any and all infrastructure required for its operation should that infrastructure not already exist.

One of the arguments about VPC CIDRS was "Where does that information come from" with the answer being "The platform team" - this argument was not suitable as in the opinion of the team it lead to a lack of self service for application teams who may need to spin up and tear down infrastructure without interaction with the platform team

Now the argument is that "The platform team should provide the RDS database" which contradicts the earlier argument.

If we're to argue that the platform team should handle all infrastructure builds then the purpose of the demo (deploying a new service) becomes mute as it does not demonstrate that a service can be deployed with all required infrastructure.

Whilst the argument made here does carry a lot of merit, it detracts from the capability.

Additionally to this, the arguments only consider RDS, ignoring entirely the Elasticache part of the service which would not work for an additional application as there are no application specific credentials attached and in fact to add credentials for a second application to Elasticache would require a modification to the replication group built by the original deployment and a restart of all replicated clusters.

This capability does not exist today and due to how Elasticache works is not something that can be built separately, as in the case of provisioning users inside RDS.

LutzLange commented 1 week ago

I do see both your point @marians and @mproffitt.

The big question here is: Who is the audience of the demo. I think it is targeting developers. And as such it should focus on more on speed than creating an environment that is ready to run production workloads. Setting up a full RDS database feels more like a getting ready for production workload task.

Developers are used to using virtual or lightweight dbs for testing & QA. Saving costs with smaller dev environments as also quite common.

marians commented 1 week ago

Who is the audience of the demo. I think it is targeting developers.

I would slightly refine this: our target audience are platform teams. The end user we impersonate for the demo flow is a developer.

marians commented 1 week ago

Talked with @mproffitt and @piontec about Elasticache. To keep things simple, we are not going to provision anything new per new service created. All services/apps will use the same Elasticache redis server. Redis supports multiple databases, but the identifier is numeric, and we wouldn't have an easy way to map database and service/project. By writing to the same database, there may be a theoretical risk of key collision, but we accept this for now, as we don't run many demos concurrently and we can set the key lifetime very short in our demo application.

puja108 commented 1 week ago

Interesting feedback around the VPC CIDRs we got from a potential customer when we showed them our IDP demo architecture was, that in their case, there's a team that has basically VPC provisioning as their main service, so they basically separate our demo into several use cases that play into each other. Still does not invalidate our demo, just that different companies might disect the use cases or services differently.

Similar, I'd say, to how we here now disect the "creation of an RDS cluster" from the "creation and provisioning of a DB in said cluster".

I think it is good to cut the demo into something rather small for now, and then be able to show the extended use cases and the complexity that @mproffitt mentioned separately, cause they will not get around the complexity, it will just move somewhere else, in the customer's case actually to a team that will for now not use our stuff to automate their processes, but that we could maybe convince at some point, which then makes it easier for them to chain and integrate platform services into a coherent user experience.

marians commented 4 days ago

More ideas

mproffitt commented 4 days ago

@marians the first point I'm agnostic towards but I wonder if that's overcomplicating things a little

The second point, no. This would automate too much and detract from showing a) what requires or should have human input and b) creates a failure point such as you selected the wrong provider Config or (future) assigned permissions to users or roles that are incorrect.

Even though a lot of this is automated, I feel automating the pr approval is a step too far and introduces entropy into the system

marians commented 4 days ago

Automating the PR approval was meant as a fake thing that would simulate what otherwise would of course be done through a human.

LutzLange commented 4 days ago

We used to just say: "And if you want, you can require PR reviews to merge your requests." And then merge them ourselves in the Demos that I did for Weaveworks.

We should be fine addressing this with the audio track.

puja108 commented 4 days ago

Yeah, most companies I've spoken to have some kind of approval process. That said, if we automated validation could be done in the PR, at least some would enable some auto-merge functionality. Usually wouldbe some kind of PR bot checking for access control (i.e. is user allowed to request said resource) and approve and then if all validation tests are green auto-merge goes through.

marians commented 3 days ago

In the GoReleaser step of the release workflow I see this log message:

only configurations files on version: 2 are supported, yours is version: 0, please update your configuration
marians commented 3 days ago

Is there a technical reason for all workloads landing in the default namespace? Would it make sense to create a namespace after the service name?

mproffitt commented 2 days ago

@marians The main driver for the demo at this stage was simplicity, also see the comment from @puja108 here https://github.com/giantswarm/roadmap/issues/3470#issuecomment-2288134103

deploying additional resources like namespaces, quotas, permissions, rbac with the platform

Out of scope here, BUT to me this is the next separate demo/feature we should work on.

As for technical reasons, In fact the crossplane compositions support a different namespace for delivering the secrets and we can use any namespace on the workload cluster for application deployment. The only thing that needs to happen is that namespace must pre-exist for ESO to send secrets to, and as per Pujas response, we had moved this out of phase 1 delivery

puja108 commented 1 hour ago

Just putting it here as a sidenote:

Namespace creation for a new project is a thing most companies have as a service and could be a cool module by itself. It could provision a namespace (with RBAC/OIDC, quota, security/network policy setup) for those use cases where there's no golden path (yet), and it could be chained with a golden path like in this demo, to remove the need for a two-step request.

The good thing is, that such a namespace provisioning service could be basically just a helm chart that takes values like project name, team name, OIDC group, and auto-maps things. It can then be extended with things like o11y multi-tenancy or network policy base by other teams like Atlas and Cabbage.

That said, I'd see that as a complementary thing that we can and should build as it's straight forward and used by many customers, but we should make that a separate project in area platform. cc @teemow this might be a nice project for Q4 or Q1 that aligns different capabilities of different teams and can generate value directly without the need for complex customer customization. We could talk to adidas and some others that already have such a thing, what features they would expect from it.

teemow commented 1 hour ago

Thanks @puja108! I've put this in a separate issue: https://github.com/giantswarm/giantswarm/issues/31767