IDP demo journey - Githubissues

mproffitt commented 5 months ago

User Story

As a customer I want to select a template in a backstage catalog so that I am provided with a skeleton application to work with, based on a template repository and per-populated with components that make up my deployment.

Details, Background

In order to take the user on a journey through the IDP, we have an overall story of creating infrastructure components via crossplane and deploying an app that then consumes that infrastructure.

To accomplish the story and really showcase the capabilities of all components in the pipeline the journey is as follows:

A user will access backstage and either create a new cluster or select an existing cluster to deploy the application to
Once the cluster is ready, the user will select a template inside backstage that is linked to a pre-defined github template repository
The user will fill out the project name and possibly some other details related to the template
When the user clicks "submit", a Kratix Promise CR is created. This CR will set up the github repo and bootstrap the repository into the cluster ready for flux to reconcile against
When flux reconciles the repository it will deploy a crossplane claim that creates infrastructure inside AWS. For the purposes of the demo this will be
- a VPC with peering connections to the cluster VPC (note - backstage/kratrix will have to hand-off the cluster VPC name)
- An RDS Database cluster
- An Elasticache cluster using redis as a backend
Flux will additionally deploy an App CR which will be consumed by App Platform to be deployed to the cluster. This will be a small application that a user can interact with which:
- writes information to the database
- reads information from elasticache
- if the cache reports cache-miss, updates the cache with the latest information from the database and then repeats the second step to send this back to the user

Flow diagrams: https://miro.com/app/board/uXjVKnjQei8=/

Architecture

high-level-platform-arch drawio

Blocked by / depends on

- [ ] https://github.com/giantswarm/roadmap/issues/3472
- [ ] https://github.com/giantswarm/roadmap/issues/3429
- [ ] https://github.com/giantswarm/roadmap/issues/3583
- [ ] https://github.com/giantswarm/roadmap/issues/3469
- [x] Set up payment for DemoTechInc to be able to speed up builds
- [x] Insert the correct module name in go.mod (currently laszlo-kratix-14)

### Improvements
- [x] Remove database cluster creation from scaffolder template
- [x] Scaffolder UI should only provide fields which are needed
- [x] Demo app UI needs improvement in a few places
- [x] Grafana dashboard for metrics and logs exposed by our service
- [x] Link to Grafana dashboard from Deployment
- [x] Scaffolder overview should provide more structure, context per item
- [x] Remove "Publish Infrastructure PR" action from scaffolder template
- [x] start releasing and tagging promises and make them use tagged images (not latest)
- [ ] https://github.com/giantswarm/roadmap/issues/3704
- [ ] https://github.com/giantswarm/roadmap/issues/3608
- [ ] https://github.com/giantswarm/roadmap/issues/3728
- [x] extract demo app to a separate repo for other teams to use it
- [ ] https://github.com/giantswarm/giantswarm/issues/31767
- [ ] https://github.com/giantswarm/roadmap/issues/3607
- [ ] https://github.com/giantswarm/giantswarm/issues/31929
- [ ] Improve build times in Github actions through faster runners
- [ ] Provide more description in pull requests created by scaffolder
- [ ] Pull requests should get validated via GitHub actions (yamllint, schema validation)
- [ ] Elasticache user groups per logical database
- [ ] Enable usages between crossplane resources to improve teardown - requires xp 1.18 for beta usages
- [ ] Better way to manage scaling target groups to zero on deletion - potential kyverno cleanup policy
- [ ] Flagger canary deployments / environment promotion
- [ ] Complete IPAM integration for VPC CIDR selection
- [ ] Progress comments after merging a PR
- [ ] o11y integration
- [ ] Auto clean Backstage catalog if the source catalog-info.yaml file was deleted
- [ ] Add automation to delete temporal Kratix Resources from Backstage catalog
- [ ] Review hardcoded parts in Backstage template and improve if possible
- [ ] Add dashboard for python project

### Research/discovery
- [ ] https://github.com/giantswarm/roadmap/issues/3706
- [ ] https://github.com/giantswarm/roadmap/issues/3738
- [ ] https://github.com/giantswarm/roadmap/issues/3739

puja108 commented 5 months ago

Shouldn't it be a Kratix resource request? I thought the kratix promise is the template/definition of what should be done, and then a request basically instantiates that once, or am I mistaken?

mproffitt commented 5 months ago

@puja108 My understanding is that the Promise is kind of like a crossplane composition and the CR is a claim against that promise. Admittedly I'm not familiar with the terminology of Kratix at this point, perhaps @piontec has a clearer definition that can fit in here.

mproffitt commented 5 months ago

Updated to add high level architectural flow

puja108 commented 5 months ago

Nice, I guess "deploy app" step also includes create namespace for you? I would honestly pull that step out and put it next to "create cluster" as for platform teams the creation of a new project/team namespace is quite an important step that includes OIDC/RBAC setup, quotas, etc. so having it more or less as a "module" would make a lot of sense.

mproffitt commented 5 months ago

That's an interesting thought - I think my main question here would be "what would be the delivery mechanism" - I can see two possibilities here - We provide a pre-packaged helm chart that can template the namespace, roles, quotas, etc and can be delivered via the app platform

The second option is to use flux to deliver these via a kustomize base path

I'd probably lean towards flux here for sake of simplicity and extensibility

@piontec What are your thoughts on this?

LutzLange commented 5 months ago

I've seen different scenarios for development environment vs. production environment.

You want to make developers as productive as possible, while production environments are stricter and controlled.

Which use case are we targeting here exactly? Is it A) Resources for developers to iterate fast? B) Rolling an Application out into a production or pre production environment?

I do understand showing infrastructure management, but feel we should be able highlight the value of what we are delivering. In an Integrated Development Environment, I would want to show a flow that allows a developer to quickly go through code / deploy / feedback / fix cycles.

mproffitt commented 5 months ago

but feel we should be able highlight the value of what we are delivering

Do you care to elaborate more on how this is not highlighting the value in what we are delivering?

On the contrary I am firmly of the belief this showcases the skills the team has to offer by bringing together a number of disparate tools into a single cohesive journey in a way that customers are already requesting capability towards. Having the ability to demonstrate that is IMHO an incredibly powerful tool, and one we do not have in our arsenal today.

Perhaps I should clarify. An IDP is nothing to do with what happens in an engineers local development environment. IDP in this instance relates to an Internal Developer Platform, an interaction point between engineers and the clusters, and specifically on the portal side (backstage) a place they can go to what is happening inside the cluster and jump off towards other tools and applications that help them with this understanding.

By understanding what is going on inside the cluster, engineers are empowered towards the products they themselves manage.

The IDP would be a place they can construct deployments from off-the-shelf products, be that applications delivered as community driven helm charts, infrastructure delivered as crossplane compositions or new applications they are building driven via github template repos represented as kratix promises inside the cluster and have them delivered to the cluster via continuous deployment (flux) This is what this demo journey is showing.

A) Resources for developers to iterate fast?

This is covered in that engineers can use the platform to quickly bootrap new applications into the cluster.

B) Rolling an Application out into a production or pre production environment?

We do not care what environment the engineers are rolling to - all clusters are equal. We care only that the journey is the same irrespective of the target environment and in doing so, take away some of the pain of managing cross-environment deployments

In an Integrated Development Environment, I would want to show a flow that allows a developer to quickly go through code / deploy / feedback / fix cycles.

I do not get the relevance here. We are not interested in what happens during application development lifecycle. IDEs are not a topic I plan to support and is certainly out of scope of this journey.

LutzLange commented 5 months ago

This might be the wrong place for this discussion...

I do think your demo is a valuable tool that we don't have today. I'm just looking at the overall story and am wondering if we need another demo as well.

Our current iteration of the story & the value is : We allow your platform engineering team to focus on what it most important in your company: Making developers more productive to speed up innovation and drive business agility and business outcomes. If developers can iterate quickly, you business stays agile and can adapt quickly to changing demands.

I've seen many platform demos. And I don't deny there is value in doing them. It is a lot of work to build these. Though the demo can be a bit unimpressive in the end ( It should be ). It often comes down to pushing a button to commit a change and then automation kicking in and delivering a new / changed cluster.

Value to the business is in developers iterating faster, delivering better software quicker. Our story of "freeing up the platform engineers to be able to enable developers" is what we can show with this demo.

This feels like a bit of a stretch to me, like we should try and do more. As if we free you up to do important things but will not be able to help you with these more important tasks. Would it not be even better, if we could show that devs can actually iterate faster with our IdP? A comparison of the situation / work without the IdP and the situation with the IdP. This could be a combination of slide ware and demo ( demo with the IdP ).

I know that this is not where we are today. As I said in the beginning, this might not be the right place.

mproffitt commented 5 months ago

Architecture diagram updated to include the separation of delivery of components such as namespace, quotas, permissions, etc.

These will be delivered to the cluster using the out-of-band delivery method as described in out gitops-template

puja108 commented 5 months ago

We provide a pre-packaged helm chart that can template the namespace, roles, quotas, etc and can be delivered via the app platform

We might need to talk to Big Mac here, as they already have some RBAC helper app IIRC and this gets close to their access management ownership. Maybe that app could be provided by Big Mac and deployed by whatever means Honeybadger feels most adequate.

puja108 commented 5 months ago

As for @LutzLange comments, I feel this is beyond the scope of this PoC/demo.

This here is just about the getting started quickly step, i.e. setting up a new project with all the bells and whistles (we could over time add additional templates like e.g. for security or o11y to this). This is a big value driver that many current and most potential new customers have been asking for or even working on themselves.

Fast iteration cycles once the project is set up might be influenced by this as everything is set up right and we try to have all environments similarly configured. But there might be other things to show there, which would be based on other features we might work on at some point in the future, e.g. Flagger for canary deployments, automatic branch deployments, o11y setup and validation feedback for these,...

weatherhog commented 3 months ago

@mproffitt as the solution architect for the IDP Demo and I sat together to summarise where we are:

intermediate status:

Backstage:

Wireframes were created
PoC based on wireframes (scaffolder PoC)
based on capabilities changes were made (basically to reflect back the complexity of cloud) - mostly additional things that needed to be added, the basic structure stayed the same

Some complexity got moved to crossplane, but should not be in crossplane. From User perspective this should be outside of crossplane, mainly everything regarding the region. This requires some complexity in Backstage. The question is, is this complexity already available in backstage (based on post in #news-dev, the Installations page already shows Region information).

Kratix At the moment we need to ask ourselves which role does Kratix have in our Developer Platform and do we really need it. @piontec made it work to bootstrap the app-template Demo. @uvegla is figuring out details to make this work on golem, as it was only working in KinD clusters so far. The initial idea was that Kratix solves some pain points by setting up the gitops Repo, bootstrapping it to the cluster and reporting it back to backstage. This should solve the pain points that customers need to create their own gitopsRepo and creating secrets to bootstrap it to the cluster. We have seen issues with this before and need to solve this. It seems like Kratix is not the right solution for this, as Kratix controls the resources. We need to rethink what Kratix role is in our platform (this is not about dropping Kratix, more about refining what Kratix role is)

Crossplane

Crossplane is bootstrapped into Management Clusters with all providers and functions relevant for the demo.
Crossplane has been integrated with ESO, enabling some dogfooding (we as GS never had a usecase for ESO, but now we have)
Crossplane Compositions were written for following AWS components:
- VPC
- Peering
- Resource Access Manager
- Transit Gateway (still WIP)
- RDS
- Elasticache
- provisioning inside RDS (Database creations for example - untested)
- binding that creates VPC, RDS and Elasticache
- wrapper for the binding which enables looking up the workload cluster region and availability zones
Documentation for the Crossplane APIs got created

Demo App (app that gets deployed into the demo cluster)

work has not been started yet on this
App requirements:
- App needs to be deployed into cluster
- Needs to be capable of reading from Elasticache, if there is no value in Elasticache it then should read from RDS and write the value back into Elasticache and then return it preferably fron Elasticache.
- Needs to accept data from User and write date into RDS (only goes into Elasticache if requested by read)
- @mproffitt found a python script which was already capable of most of our requirements, updated the script and made it work (original script: https://github.com/mathurk1/Flask-Postgres-Redis-Docker-App)

The question that we need to ask ourselves, do we need to write something on our own in Go, or can we be happy with the app being written in python, which already exists. The goal is to show the platform and how it works and the goal should not be to show that we can write apps in Go - which is also clear because we have plenty of apps in Go.

Completeness of our components (approximately):

Backstage 70%
Crossplane 70% - 80%
Kratix needs to be discussed
Demo App 0% (or 80% if we take the python one)

Tasks:
[ ] Evaluate and discuss Kratix (like mentioned above)
[ ] Region lookup logic needs to go into Backstage
[x] Transit Gateway needs to be completed
[x] Crossplane needs to be tested on Management cluster end-to-end
[ ] Kratix needs to be implemented on golem and tested
[x] Make a decision on are we going with Go or with Python for the app
- [x] If App needs to be in Go, app needs to be implemented
- [ ] If we go with python app, app needs to be finalised

After these tasks are done, the final task is to put everything together and test the whole platform end-to-end.

Nice to have extension points (can also be discussed):

Is the workload cluster a prerequisite or do we want to create the workload cluster with the platform?
deploying additional resources like namespaces, quotas, permissions, rbac with the platform
preselect components for your app in backstage (for example, choosing specific preexisting RDS and then creating Database inside that specific RDS instance), this could be important for demoing as component creation takes time

cc: @marians @gusevda @piontec @mproffitt @uvegla

mproffitt commented 3 months ago

This is the more complete architecture as used in the interim demo

puja108 commented 3 months ago

I'd like to phase this work a bit more so we can focus on getting a minimal thing out that is demoable soon.

To me that would mean

phase 1

aim to have this by end of month so we can achieve the company goal of doing the demos in September
Focus on getting the python app going and wire everything together into a coherent demo
Leave out Kratix, rely on Backstage git integration
No further optimisation of crossplane re region if it is working right now, consider cutting things out that are not working and moving to phase 2
not sure if transitgateway is needed here (is it?)

phase 2:

improvements to crossplane setup re region, remaining crossplane things

And at the same time, once @piontec is back we can discuss direction with Kratix and if what we are intending to do with it (aka API to Git) would work or not, but pull it out of this demo for now and continue in Kratix specific epic.

puja108 commented 3 months ago

As for:

Is the workload cluster a prerequisite or do we want to create the workload cluster with the platform?

WC creation is out of scope. Could be a separate demo where we aim for WC creation from Backstage. Needs issue.

deploying additional resources like namespaces, quotas, permissions, rbac with the platform

Out of scope here, BUT to me this is the next separate demo/feature we should work on. Deploying a ready "environment" for a dev team to a cluster is a super common use case that also a lot of current customers have. This we should then do at least in cooperation with Big Mac, as they do have some early work towards at least the RBAC part of it. Also needs issue.

preselect components for your app in backstage (for example, choosing specific preexisting RDS and then creating Database inside that specific RDS instance), this could be important for demoing as component creation takes time

If possible I'd like to keep this out of phase 1. Could be part of phase 2. For the demo in phase 1 I would rely on two instantions, one where we have run the demo in advance and everything works, and one where we show the creation and switch from one to the other when we want to show things working.

mproffitt commented 3 months ago

not sure if transitgateway is needed here (is it?)

I would have very much liked to avoid transit gateway - It's the one network component I have always struggled with but unfortunately yes, it is needed.

The issue with requiring it relates to being able to get crossplane talking from the Management Cluster to inside the RDS database in order to set up the application database, users/roles and grants - this I considered to be important enough to add additional effort to get working - It's all well and good creating a database server/cluster but if that database cannot be used by the app that needs to consume it, then it's not really much use and there is really no other way to communicate to the RDS without using a TGW

It could be argued "just set up an additional peering connection to the MC" but the second part of this is purpose. TGWs and Peering connections serve different purposes with peering connections being best for high-throughput / low latency and TGWs used for everything else.

On a far more positive note, despite it melting my brain, the core of the TGW is now done - I built a basic version this afternoon and I'm fairly confident that this will just work.

No further optimisation of crossplane re region

I was less clear with your meaning on this point. If this is relating to the composition wrapper that first looks up a cluster, retrieves region and availability zone data, then feeds that to the next composition wrapper then the outer wrapper has not been tested and does not require any downstream changes, it is simply a passthrough that looks up some additional details that's probably best visualised as in the diagram below - any coloured box is a separate composition, white boxes are either endpoint compositions or specific MRs (or simply a reference to what data comes from where to start the inner wrapper)

piontec commented 3 months ago

Hey all! I just came back and I need to sync with Laszlo, but a few comments from my side:

we already have a go app (a template), that was used to test and build the demo scenario. This app has many features included, like a helm chart and a ready full-fledged CI/CD process, that encompasses full build pipeline, including signing and verification with cosign. I believe it is much easier to add a simple DB code into this app, than it is to try and port all of that into the python app. We've also tested all the project templating and bootstrapping with it.
As for kratix' role, I'm not sure what happened when I was away. Kratix' role seemed pretty well defined: get an object with a project request from a user, bootstrap everything, create resources to kick off Crossplane and commit the bunch to git. This was tested on a PoC level with KinD before I left. This is also what Dmitry was aiming at with backstage integration. I believe removing kratix now will create way more work than we need to keep it and make it work on our clusters, but again, I need to sync with Laszlo first (tomorrow) to see if there are some new problems I wasn't aware of.

marians commented 3 months ago

Here is a high level process diagram I'd like us to maintain, so it represents what we are buidling. (Not finished yet)

https://miro.com/app/board/uXjVKnjQei8=/

weatherhog commented 3 months ago

the decision was made to follow the path with the Go App. Which still needs to be modified.

weatherhog commented 3 months ago

This is the list of apps deployed as part of the release

rdsapp                                1.0.1               2m54s        2m53s           deployed
rdsapp-app-operator                   6.11.0              2m45s        2m43s           deployed
rdsapp-aws-ebs-csi-driver-smons                           2m51s
rdsapp-aws-pod-identity-webhook                           2m51s
rdsapp-capi-node-labeler                                  2m51s
rdsapp-cert-exporter                                      2m51s
rdsapp-cert-manager                                       2m51s
rdsapp-chart-operator                                     2m45s
rdsapp-chart-operator-extensions                          2m51s
rdsapp-cilium-servicemonitors                             2m51s
rdsapp-cluster-autoscaler                                 2m51s
rdsapp-etcd-k8s-res-count-exporter                        2m51s
rdsapp-external-dns                                       2m51s
rdsapp-irsa-servicemonitors                               2m51s
rdsapp-k8s-audit-metrics                                  2m51s
rdsapp-k8s-dns-node-cache                                 2m51s
rdsapp-metrics-server                                     2m51s
rdsapp-net-exporter                                       2m51s
rdsapp-node-exporter                                      2m51s
rdsapp-observability-bundle                               2m51s
rdsapp-prometheus-blackbox-exporter                       2m51s
rdsapp-security-bundle                                    2m51s
rdsapp-teleport-kube-agent                                2m51s
rdsapp-vertical-pod-autoscaler                            2m51s

marians commented 2 months ago

Regarding Backstage UI wording:

Card title

Target entity -> Creation progress

Target component is not available yet. See Kratix resources for more information.

should be changed to:

For resources to be created, this pull request must be merged. After merging, it can take several minutes for resource creation to start.

Once resources get created, you can track creation progress in the Kratix resources tab.

Once the catalog entity exists, we show this:

The pull request defining these resources has been merged.

See resource creation details in the Kratix resources tab.

View the entity page for marians-demo-service to see more details about the component and its deployments.

marians commented 2 months ago

We are changing the demo flow as follows:

A database server/cluster is required to exist as a prerequisite
For any new service created during the demo, a logical database is added to the cluster/server associated with the deployment k8s cluster

Rationale:

We think that creating an RDS cluster for each service in development creates huge overhead, so the scenario is not very likely in a customer environment.
We want to create a minimum amount of RDS servers for cost saving reasons
RDS cluster provisioning takes time, and we want the demo to run as fast as possible
Creating a database server/cluster requires decisions from the end user that are hard to make for an engineer with limited knowledge of the infrastructure. This is more likely a platform team's job.

mproffitt commented 2 months ago

In my opinion this removes the capability of showing a very important aspect of the journey in that an app can be deployed along with any and all infrastructure required for its operation should that infrastructure not already exist.

One of the arguments about VPC CIDRS was "Where does that information come from" with the answer being "The platform team" - this argument was not suitable as in the opinion of the team it lead to a lack of self service for application teams who may need to spin up and tear down infrastructure without interaction with the platform team

Now the argument is that "The platform team should provide the RDS database" which contradicts the earlier argument.

If we're to argue that the platform team should handle all infrastructure builds then the purpose of the demo (deploying a new service) becomes mute as it does not demonstrate that a service can be deployed with all required infrastructure.

Whilst the argument made here does carry a lot of merit, it detracts from the capability.

Additionally to this, the arguments only consider RDS, ignoring entirely the Elasticache part of the service which would not work for an additional application as there are no application specific credentials attached and in fact to add credentials for a second application to Elasticache would require a modification to the replication group built by the original deployment and a restart of all replicated clusters.

This capability does not exist today and due to how Elasticache works is not something that can be built separately, as in the case of provisioning users inside RDS.

LutzLange commented 2 months ago

I do see both your point @marians and @mproffitt.

The big question here is: Who is the audience of the demo. I think it is targeting developers. And as such it should focus on more on speed than creating an environment that is ready to run production workloads. Setting up a full RDS database feels more like a getting ready for production workload task.

Developers are used to using virtual or lightweight dbs for testing & QA. Saving costs with smaller dev environments as also quite common.

marians commented 2 months ago

Who is the audience of the demo. I think it is targeting developers.

I would slightly refine this: our target audience are platform teams. The end user we impersonate for the demo flow is a developer.

marians commented 2 months ago

Talked with @mproffitt and @piontec about Elasticache. To keep things simple, we are not going to provision anything new per new service created. All services/apps will use the same Elasticache redis server. Redis supports multiple databases, but the identifier is numeric, and we wouldn't have an easy way to map database and service/project. By writing to the same database, there may be a theoretical risk of key collision, but we accept this for now, as we don't run many demos concurrently and we can set the key lifetime very short in our demo application.

puja108 commented 2 months ago

Interesting feedback around the VPC CIDRs we got from a potential customer when we showed them our IDP demo architecture was, that in their case, there's a team that has basically VPC provisioning as their main service, so they basically separate our demo into several use cases that play into each other. Still does not invalidate our demo, just that different companies might disect the use cases or services differently.

Similar, I'd say, to how we here now disect the "creation of an RDS cluster" from the "creation and provisioning of a DB in said cluster".

I think it is good to cut the demo into something rather small for now, and then be able to show the extended use cases and the complexity that @mproffitt mentioned separately, cause they will not get around the complexity, it will just move somewhere else, in the customer's case actually to a team that will for now not use our stuff to automate their processes, but that we could maybe convince at some point, which then makes it easier for them to chain and integrate platform services into a coherent user experience.

marians commented 2 months ago

More ideas

It would be so nice if we could have progress comments after merging a PR like https://github.com/DemoTechInc/demotech-gitops/pull/82, directly in the same PR.
When running the demo, I am the creator of such a PR, so I am not able to review/approve it myself. To merge it, I have to bypass branch protection rules. It would be more realistic if some use (bot) would approve the PR instead.

mproffitt commented 2 months ago

@marians the first point I'm agnostic towards but I wonder if that's overcomplicating things a little

The second point, no. This would automate too much and detract from showing a) what requires or should have human input and b) creates a failure point such as you selected the wrong provider Config or (future) assigned permissions to users or roles that are incorrect.

Even though a lot of this is automated, I feel automating the pr approval is a step too far and introduces entropy into the system

marians commented 2 months ago

Automating the PR approval was meant as a fake thing that would simulate what otherwise would of course be done through a human.

LutzLange commented 2 months ago

We used to just say: "And if you want, you can require PR reviews to merge your requests." And then merge them ourselves in the Demos that I did for Weaveworks.

We should be fine addressing this with the audio track.

puja108 commented 2 months ago

Yeah, most companies I've spoken to have some kind of approval process. That said, if we automated validation could be done in the PR, at least some would enable some auto-merge functionality. Usually wouldbe some kind of PR bot checking for access control (i.e. is user allowed to request said resource) and approve and then if all validation tests are green auto-merge goes through.

marians commented 2 months ago

In the GoReleaser step of the release workflow I see this log message:

only configurations files on version: 2 are supported, yours is version: 0, please update your configuration

marians commented 2 months ago

Is there a technical reason for all workloads landing in the default namespace? Would it make sense to create a namespace after the service name?

mproffitt commented 2 months ago

@marians The main driver for the demo at this stage was simplicity, also see the comment from @puja108 here https://github.com/giantswarm/roadmap/issues/3470#issuecomment-2288134103

deploying additional resources like namespaces, quotas, permissions, rbac with the platform

Out of scope here, BUT to me this is the next separate demo/feature we should work on.

As for technical reasons, In fact the crossplane compositions support a different namespace for delivering the secrets and we can use any namespace on the workload cluster for application deployment. The only thing that needs to happen is that namespace must pre-exist for ESO to send secrets to, and as per Pujas response, we had moved this out of phase 1 delivery

puja108 commented 2 months ago

Just putting it here as a sidenote:

Namespace creation for a new project is a thing most companies have as a service and could be a cool module by itself. It could provision a namespace (with RBAC/OIDC, quota, security/network policy setup) for those use cases where there's no golden path (yet), and it could be chained with a golden path like in this demo, to remove the need for a two-step request.

The good thing is, that such a namespace provisioning service could be basically just a helm chart that takes values like project name, team name, OIDC group, and auto-maps things. It can then be extended with things like o11y multi-tenancy or network policy base by other teams like Atlas and Cabbage.

That said, I'd see that as a complementary thing that we can and should build as it's straight forward and used by many customers, but we should make that a separate project in area platform. cc @teemow this might be a nice project for Q4 or Q1 that aligns different capabilities of different teams and can generate value directly without the need for complex customer customization. We could talk to adidas and some others that already have such a thing, what features they would expect from it.

teemow commented 2 months ago

Thanks @puja108! I've put this in a separate issue: https://github.com/giantswarm/giantswarm/issues/31767

LutzLange commented 2 months ago

There is a lot of value in these basic templates. Another template that I have seen in the wild is : "Create a Git repo' They need to be setup in the right way to keep things in order. There is naming conventions and security settings to take into consideration. Those should not be left open for developers to chose if you want to keep chaos at bay.

We already have this implemented as part of the IDP demo. It would make sense to pull this out as a separate template as well..

LutzLange commented 1 month ago

Franz wanted us to have some Governance aspects in the demo as well.

Governance has 2 parts: A) Security B) Compliance

A: How do we make sure things are secure? --> Security by default with kyverno. --> Maybe Content Scanner + Rennovate?

B: Compliance Is a combination of secure & compliant settings. Where a Company needs to comply with a certain set of regulations by creating organisational procedures guides and the correct security settings. A big part of Compliance is proving that you are compliant. You need auditable systems for this. If we are using GitOps it is easy to prove who did what and which settings were put in place at what time.

I think we can cover good parts of this without changing the technical part of the demo, but by addressing these in the audio track.

mproffitt commented 1 month ago

@LutzLange We were planning on addressing some of this with @giantswarm/team-shield next week and have already included trivy scan integration in the list of potential improvements provided in the description. of this issue.

For the moment though, for the audio-track we can already highlight how we ensure some security, split into two topics

in cluster (image scans, kyverno, etc)
cloud (disable default security groups, allow ingress only, no default VPCs, secure connections by default, encryption at rest, etc)

We should be careful on the cloud security side though as this is not a topic we traditionally cover and this would normally be the responsibility of the customer cloud security team - I would be hesitant to get bogged down here as it's a whole topic unto itself however as we're showing building infrastructure, we can anticipate some questions towards the topic.

puja108 commented 1 month ago

AFAIK we already have SBOMs and signatures in the build process and store them in the OCI registry. Not sure if we are already checking for those in cluster, but that might be an easy next step (enabled only for the app namespace to not break the whole cluster).

We also already have PSS enforcement in-cluster, not sure if we also have network policies, but that could be added. On this level we could mention that you need a combination of in-cluster enforcement and "adding the actual security rules and exceptions to the app". As in this case we are creating an app from a template this means the template needs to include those things and be "secure by default", which I would guess it is, if it runs smoothly in our clusters.

CVEs scans and reporting would be a good next feature for platform in Q4, but we need to discuss that on a general level and I don't think it makes sense to just smash it into the demo right now, as there the process is more important than just showing CVEs.

LutzLange commented 1 month ago

There is a lot of value in simpler templates. You could also call them building blocks. They are valuable PE services on their own:

A) Create a Git Repository (ready to use with security & policy) B) Create a Namespace (ready to use with security, limits & policy ) C) Create an EC2 instance (...)

The self service aspect of these templates provides a lot of value. And If we can find a way to combine these building blocks into more complex templates easily. We would have a set of common building blocks and provide a lot of value to possible customers. I know these last points need further thought, investigation and discussion. but we could and should start with these simpler templates first.

mproffitt commented 1 month ago

Whilst I definitely agree with there being a lot of value in simpler templates, this goes far beyond the scope of the demo journey and more towards turning the demo into a full fledged ready to use platform.

My opinion on the current IDP demo is to attempt to answer some of the hardest questions facing the industry today.

Moving the demo to become a more rounded and evolved product should not be in scope for the demo platform, but should be scoped separately to this current journey as it involves considerable additional thought, planning and implementation that significantly impacts the delivery of key features not even yet given hard consideration.

This will definitely be an iterative process, however trying to implement simpler templates at this stage would have significant impacts on key questions that we've already been asked.

I would propose that discussions on simple templates be moved to a separate "platform progression" epic, except where otherwise in scope for phase 2.

Effectively this leaves B, and potentially A still in scope but C moves out.

puja108 commented 1 month ago

Along those lines, I think we should start closing the first demo issue, and create follow-ups, for which we can then discuss priorities also wrt to the many other things Honeybadger should/could do in the next months. I've prepared some list to show the complexity of the roadmap decision for the team, but we need to talk about it soon to get clarity what we want to do going forward (at least as long as we don't have a concrete customer to work with).

LutzLange commented 3 weeks ago

I just created a separate ticket with my suggestions for improvements. I've scheduled a call for 6-Nov with the Honeybadger team to discuss.

giantswarm / roadmap

IDP demo journey #3470

User Story

Details, Background

Architecture

Blocked by / depends on