Open mproffitt opened 5 months ago
Shouldn't it be a Kratix resource request? I thought the kratix promise is the template/definition of what should be done, and then a request basically instantiates that once, or am I mistaken?
@puja108 My understanding is that the Promise
is kind of like a crossplane composition and the CR is a claim against that promise. Admittedly I'm not familiar with the terminology of Kratix at this point, perhaps @piontec has a clearer definition that can fit in here.
Updated to add high level architectural flow
Nice, I guess "deploy app" step also includes create namespace for you? I would honestly pull that step out and put it next to "create cluster" as for platform teams the creation of a new project/team namespace is quite an important step that includes OIDC/RBAC setup, quotas, etc. so having it more or less as a "module" would make a lot of sense.
That's an interesting thought - I think my main question here would be "what would be the delivery mechanism" - I can see two possibilities here - We provide a pre-packaged helm chart that can template the namespace, roles, quotas, etc and can be delivered via the app platform
The second option is to use flux to deliver these via a kustomize base path
I'd probably lean towards flux here for sake of simplicity and extensibility
@piontec What are your thoughts on this?
I've seen different scenarios for development environment vs. production environment.
You want to make developers as productive as possible, while production environments are stricter and controlled.
Which use case are we targeting here exactly? Is it A) Resources for developers to iterate fast? B) Rolling an Application out into a production or pre production environment?
I do understand showing infrastructure management, but feel we should be able highlight the value of what we are delivering. In an Integrated Development Environment, I would want to show a flow that allows a developer to quickly go through code / deploy / feedback / fix cycles.
but feel we should be able highlight the value of what we are delivering
Do you care to elaborate more on how this is not highlighting the value in what we are delivering?
On the contrary I am firmly of the belief this showcases the skills the team has to offer by bringing together a number of disparate tools into a single cohesive journey in a way that customers are already requesting capability towards. Having the ability to demonstrate that is IMHO an incredibly powerful tool, and one we do not have in our arsenal today.
Perhaps I should clarify. An IDP is nothing to do with what happens in an engineers local development environment. IDP in this instance relates to an Internal Developer Platform, an interaction point between engineers and the clusters, and specifically on the portal side (backstage
) a place they can go to what is happening inside the cluster and jump off towards other tools and applications that help them with this understanding.
By understanding what is going on inside the cluster, engineers are empowered towards the products they themselves manage.
The IDP would be a place they can construct deployments from off-the-shelf products, be that applications delivered as community driven helm charts, infrastructure delivered as crossplane
compositions or new applications they are building driven via github template repos represented as kratix
promises inside the cluster and have them delivered to the cluster via continuous deployment (flux
) This is what this demo journey is showing.
A) Resources for developers to iterate fast?
This is covered in that engineers can use the platform to quickly bootrap new applications into the cluster.
B) Rolling an Application out into a production or pre production environment?
We do not care what environment the engineers are rolling to - all clusters are equal. We care only that the journey is the same irrespective of the target environment and in doing so, take away some of the pain of managing cross-environment deployments
In an Integrated Development Environment, I would want to show a flow that allows a developer to quickly go through code / deploy / feedback / fix cycles.
I do not get the relevance here. We are not interested in what happens during application development lifecycle. IDEs are not a topic I plan to support and is certainly out of scope of this journey.
This might be the wrong place for this discussion...
I do think your demo is a valuable tool that we don't have today. I'm just looking at the overall story and am wondering if we need another demo as well.
Our current iteration of the story & the value is : We allow your platform engineering team to focus on what it most important in your company: Making developers more productive to speed up innovation and drive business agility and business outcomes. If developers can iterate quickly, you business stays agile and can adapt quickly to changing demands.
I've seen many platform demos. And I don't deny there is value in doing them. It is a lot of work to build these. Though the demo can be a bit unimpressive in the end ( It should be ). It often comes down to pushing a button to commit a change and then automation kicking in and delivering a new / changed cluster.
Value to the business is in developers iterating faster, delivering better software quicker. Our story of "freeing up the platform engineers to be able to enable developers" is what we can show with this demo.
This feels like a bit of a stretch to me, like we should try and do more. As if we free you up to do important things but will not be able to help you with these more important tasks. Would it not be even better, if we could show that devs can actually iterate faster with our IdP? A comparison of the situation / work without the IdP and the situation with the IdP. This could be a combination of slide ware and demo ( demo with the IdP ).
I know that this is not where we are today. As I said in the beginning, this might not be the right place.
Architecture diagram updated to include the separation of delivery of components such as namespace, quotas, permissions, etc.
These will be delivered to the cluster using the out-of-band
delivery method as described in out gitops-template
We provide a pre-packaged helm chart that can template the namespace, roles, quotas, etc and can be delivered via the app platform
We might need to talk to Big Mac here, as they already have some RBAC helper app IIRC and this gets close to their access management ownership. Maybe that app could be provided by Big Mac and deployed by whatever means Honeybadger feels most adequate.
As for @LutzLange comments, I feel this is beyond the scope of this PoC/demo.
This here is just about the getting started quickly step, i.e. setting up a new project with all the bells and whistles (we could over time add additional templates like e.g. for security or o11y to this). This is a big value driver that many current and most potential new customers have been asking for or even working on themselves.
Fast iteration cycles once the project is set up might be influenced by this as everything is set up right and we try to have all environments similarly configured. But there might be other things to show there, which would be based on other features we might work on at some point in the future, e.g. Flagger for canary deployments, automatic branch deployments, o11y setup and validation feedback for these,...
@mproffitt as the solution architect for the IDP Demo and I sat together to summarise where we are:
intermediate status:
Backstage:
cloud
) - mostly additional things that needed to be added, the basic structure stayed the sameSome complexity got moved to crossplane, but should not be in crossplane. From User perspective this should be outside of crossplane, mainly everything regarding the region
. This requires some complexity in Backstage. The question is, is this complexity already available in backstage (based on post in #news-dev, the Installations
page already shows Region
information).
Kratix
At the moment we need to ask ourselves which role does Kratix have in our Developer Platform and do we really need it.
@piontec made it work to bootstrap the app-template
Demo. @uvegla is figuring out details to make this work on golem
, as it was only working in KinD clusters so far.
The initial idea was that Kratix solves some pain points by setting up the gitops
Repo, bootstrapping it to the cluster and reporting it back to backstage. This should solve the pain points that customers need to create their own gitops
Repo and creating secrets to bootstrap it to the cluster. We have seen issues with this before and need to solve this. It seems like Kratix is not the right solution for this, as Kratix controls the resources.
We need to rethink what Kratix role is in our platform (this is not about dropping Kratix, more about refining what Kratix role is)
Crossplane
Demo App (app that gets deployed into the demo cluster)
The question that we need to ask ourselves, do we need to write something on our own in Go, or can we be happy with the app being written in python, which already exists. The goal is to show the platform and how it works and the goal should not be to show that we can write apps in Go - which is also clear because we have plenty of apps in Go.
Completeness of our components (approximately):
Demo App 0% (or 80% if we take the python one)
Tasks:
golem
and testedAfter these tasks are done, the final task is to put everything together and test the whole platform end-to-end.
Nice to have extension points (can also be discussed):
namespaces
, quotas
, permissions
, rbac
with the platformcc: @marians @gusevda @piontec @mproffitt @uvegla
This is the more complete architecture as used in the interim demo
I'd like to phase this work a bit more so we can focus on getting a minimal thing out that is demoable soon.
To me that would mean
phase 1
phase 2:
And at the same time, once @piontec is back we can discuss direction with Kratix and if what we are intending to do with it (aka API to Git) would work or not, but pull it out of this demo for now and continue in Kratix specific epic.
As for:
Is the workload cluster a prerequisite or do we want to create the workload cluster with the platform?
WC creation is out of scope. Could be a separate demo where we aim for WC creation from Backstage. Needs issue.
deploying additional resources like namespaces, quotas, permissions, rbac with the platform
Out of scope here, BUT to me this is the next separate demo/feature we should work on. Deploying a ready "environment" for a dev team to a cluster is a super common use case that also a lot of current customers have. This we should then do at least in cooperation with Big Mac, as they do have some early work towards at least the RBAC part of it. Also needs issue.
preselect components for your app in backstage (for example, choosing specific preexisting RDS and then creating Database inside that specific RDS instance), this could be important for demoing as component creation takes time
If possible I'd like to keep this out of phase 1. Could be part of phase 2. For the demo in phase 1 I would rely on two instantions, one where we have run the demo in advance and everything works, and one where we show the creation and switch from one to the other when we want to show things working.
not sure if transitgateway is needed here (is it?)
I would have very much liked to avoid transit gateway - It's the one network component I have always struggled with but unfortunately yes, it is needed.
The issue with requiring it relates to being able to get crossplane
talking from the Management Cluster to inside the RDS database in order to set up the application database, users/roles and grants - this I considered to be important enough to add additional effort to get working - It's all well and good creating a database server/cluster but if that database cannot be used by the app that needs to consume it, then it's not really much use and there is really no other way to communicate to the RDS without using a TGW
It could be argued "just set up an additional peering connection to the MC" but the second part of this is purpose. TGWs and Peering connections serve different purposes with peering connections being best for high-throughput / low latency and TGWs used for everything else.
On a far more positive note, despite it melting my brain, the core of the TGW is now done - I built a basic version this afternoon and I'm fairly confident that this will just work.
No further optimisation of crossplane re region
I was less clear with your meaning on this point. If this is relating to the composition wrapper that first looks up a cluster, retrieves region and availability zone data, then feeds that to the next composition wrapper then the outer wrapper has not been tested and does not require any downstream changes, it is simply a passthrough that looks up some additional details that's probably best visualised as in the diagram below - any coloured box is a separate composition, white boxes are either endpoint compositions or specific MRs (or simply a reference to what data comes from where to start the inner wrapper)
Hey all! I just came back and I need to sync with Laszlo, but a few comments from my side:
cosign
. I believe it is much easier to add a simple DB code into this app, than it is to try and port all of that into the python app. We've also tested all the project templating and bootstrapping with it.Here is a high level process diagram I'd like us to maintain, so it represents what we are buidling. (Not finished yet)
the decision was made to follow the path with the Go App. Which still needs to be modified.
This is the list of apps deployed as part of the release
rdsapp 1.0.1 2m54s 2m53s deployed
rdsapp-app-operator 6.11.0 2m45s 2m43s deployed
rdsapp-aws-ebs-csi-driver-smons 2m51s
rdsapp-aws-pod-identity-webhook 2m51s
rdsapp-capi-node-labeler 2m51s
rdsapp-cert-exporter 2m51s
rdsapp-cert-manager 2m51s
rdsapp-chart-operator 2m45s
rdsapp-chart-operator-extensions 2m51s
rdsapp-cilium-servicemonitors 2m51s
rdsapp-cluster-autoscaler 2m51s
rdsapp-etcd-k8s-res-count-exporter 2m51s
rdsapp-external-dns 2m51s
rdsapp-irsa-servicemonitors 2m51s
rdsapp-k8s-audit-metrics 2m51s
rdsapp-k8s-dns-node-cache 2m51s
rdsapp-metrics-server 2m51s
rdsapp-net-exporter 2m51s
rdsapp-node-exporter 2m51s
rdsapp-observability-bundle 2m51s
rdsapp-prometheus-blackbox-exporter 2m51s
rdsapp-security-bundle 2m51s
rdsapp-teleport-kube-agent 2m51s
rdsapp-vertical-pod-autoscaler 2m51s
Regarding Backstage UI wording:
Card title
Target entity
-> Creation progress
Target component is not available yet. See Kratix resources for more information.
should be changed to:
For resources to be created, this pull request must be merged. After merging, it can take several minutes for resource creation to start.
Once resources get created, you can track creation progress in the Kratix resources tab.
Once the catalog entity exists, we show this:
The pull request defining these resources has been merged.
See resource creation details in the Kratix resources tab.
View the entity page for marians-demo-service to see more details about the component and its deployments.
We are changing the demo flow as follows:
Rationale:
In my opinion this removes the capability of showing a very important aspect of the journey in that an app can be deployed along with any and all infrastructure required for its operation should that infrastructure not already exist.
One of the arguments about VPC CIDRS was "Where does that information come from" with the answer being "The platform team" - this argument was not suitable as in the opinion of the team it lead to a lack of self service for application teams who may need to spin up and tear down infrastructure without interaction with the platform team
Now the argument is that "The platform team should provide the RDS database" which contradicts the earlier argument.
If we're to argue that the platform team should handle all infrastructure builds then the purpose of the demo (deploying a new service) becomes mute as it does not demonstrate that a service can be deployed with all required infrastructure.
Whilst the argument made here does carry a lot of merit, it detracts from the capability.
Additionally to this, the arguments only consider RDS, ignoring entirely the Elasticache part of the service which would not work for an additional application as there are no application specific credentials attached and in fact to add credentials for a second application to Elasticache would require a modification to the replication group built by the original deployment and a restart of all replicated clusters.
This capability does not exist today and due to how Elasticache works is not something that can be built separately, as in the case of provisioning users inside RDS.
I do see both your point @marians and @mproffitt.
The big question here is: Who is the audience of the demo. I think it is targeting developers. And as such it should focus on more on speed than creating an environment that is ready to run production workloads. Setting up a full RDS database feels more like a getting ready for production workload task.
Developers are used to using virtual or lightweight dbs for testing & QA. Saving costs with smaller dev environments as also quite common.
Who is the audience of the demo. I think it is targeting developers.
I would slightly refine this: our target audience are platform teams. The end user we impersonate for the demo flow is a developer.
Talked with @mproffitt and @piontec about Elasticache. To keep things simple, we are not going to provision anything new per new service created. All services/apps will use the same Elasticache redis server. Redis supports multiple databases, but the identifier is numeric, and we wouldn't have an easy way to map database and service/project. By writing to the same database, there may be a theoretical risk of key collision, but we accept this for now, as we don't run many demos concurrently and we can set the key lifetime very short in our demo application.
Interesting feedback around the VPC CIDRs we got from a potential customer when we showed them our IDP demo architecture was, that in their case, there's a team that has basically VPC provisioning as their main service, so they basically separate our demo into several use cases that play into each other. Still does not invalidate our demo, just that different companies might disect the use cases or services differently.
Similar, I'd say, to how we here now disect the "creation of an RDS cluster" from the "creation and provisioning of a DB in said cluster".
I think it is good to cut the demo into something rather small for now, and then be able to show the extended use cases and the complexity that @mproffitt mentioned separately, cause they will not get around the complexity, it will just move somewhere else, in the customer's case actually to a team that will for now not use our stuff to automate their processes, but that we could maybe convince at some point, which then makes it easier for them to chain and integrate platform services into a coherent user experience.
More ideas
It would be so nice if we could have progress comments after merging a PR like https://github.com/DemoTechInc/demotech-gitops/pull/82, directly in the same PR.
When running the demo, I am the creator of such a PR, so I am not able to review/approve it myself. To merge it, I have to bypass branch protection rules. It would be more realistic if some use (bot) would approve the PR instead.
@marians the first point I'm agnostic towards but I wonder if that's overcomplicating things a little
The second point, no. This would automate too much and detract from showing a) what requires or should have human input and b) creates a failure point such as you selected the wrong provider Config or (future) assigned permissions to users or roles that are incorrect.
Even though a lot of this is automated, I feel automating the pr approval is a step too far and introduces entropy into the system
Automating the PR approval was meant as a fake thing that would simulate what otherwise would of course be done through a human.
We used to just say: "And if you want, you can require PR reviews to merge your requests." And then merge them ourselves in the Demos that I did for Weaveworks.
We should be fine addressing this with the audio track.
Yeah, most companies I've spoken to have some kind of approval process. That said, if we automated validation could be done in the PR, at least some would enable some auto-merge functionality. Usually wouldbe some kind of PR bot checking for access control (i.e. is user allowed to request said resource) and approve and then if all validation tests are green auto-merge goes through.
In the GoReleaser step of the release workflow I see this log message:
only configurations files on version: 2 are supported, yours is version: 0, please update your configuration
Is there a technical reason for all workloads landing in the default
namespace? Would it make sense to create a namespace after the service name?
@marians The main driver for the demo at this stage was simplicity, also see the comment from @puja108 here https://github.com/giantswarm/roadmap/issues/3470#issuecomment-2288134103
deploying additional resources like namespaces, quotas, permissions, rbac with the platform
Out of scope here, BUT to me this is the next separate demo/feature we should work on.
As for technical reasons, In fact the crossplane compositions support a different namespace for delivering the secrets and we can use any namespace on the workload cluster for application deployment. The only thing that needs to happen is that namespace must pre-exist for ESO to send secrets to, and as per Pujas response, we had moved this out of phase 1 delivery
Just putting it here as a sidenote:
Namespace creation for a new project is a thing most companies have as a service and could be a cool module by itself. It could provision a namespace (with RBAC/OIDC, quota, security/network policy setup) for those use cases where there's no golden path (yet), and it could be chained with a golden path like in this demo, to remove the need for a two-step request.
The good thing is, that such a namespace provisioning service could be basically just a helm chart that takes values like project name, team name, OIDC group, and auto-maps things. It can then be extended with things like o11y multi-tenancy or network policy base by other teams like Atlas and Cabbage.
That said, I'd see that as a complementary thing that we can and should build as it's straight forward and used by many customers, but we should make that a separate project in area platform. cc @teemow this might be a nice project for Q4 or Q1 that aligns different capabilities of different teams and can generate value directly without the need for complex customer customization. We could talk to adidas and some others that already have such a thing, what features they would expect from it.
Thanks @puja108! I've put this in a separate issue: https://github.com/giantswarm/giantswarm/issues/31767
There is a lot of value in these basic templates. Another template that I have seen in the wild is : "Create a Git repo' They need to be setup in the right way to keep things in order. There is naming conventions and security settings to take into consideration. Those should not be left open for developers to chose if you want to keep chaos at bay.
We already have this implemented as part of the IDP demo. It would make sense to pull this out as a separate template as well..
Franz wanted us to have some Governance aspects in the demo as well.
Governance has 2 parts: A) Security B) Compliance
A: How do we make sure things are secure? --> Security by default with kyverno. --> Maybe Content Scanner + Rennovate?
B: Compliance Is a combination of secure & compliant settings. Where a Company needs to comply with a certain set of regulations by creating organisational procedures guides and the correct security settings. A big part of Compliance is proving that you are compliant. You need auditable systems for this. If we are using GitOps it is easy to prove who did what and which settings were put in place at what time.
I think we can cover good parts of this without changing the technical part of the demo, but by addressing these in the audio track.
@LutzLange We were planning on addressing some of this with @giantswarm/team-shield next week and have already included trivy scan integration in the list of potential improvements provided in the description. of this issue.
For the moment though, for the audio-track we can already highlight how we ensure some security, split into two topics
We should be careful on the cloud security side though as this is not a topic we traditionally cover and this would normally be the responsibility of the customer cloud security team - I would be hesitant to get bogged down here as it's a whole topic unto itself however as we're showing building infrastructure, we can anticipate some questions towards the topic.
AFAIK we already have SBOMs and signatures in the build process and store them in the OCI registry. Not sure if we are already checking for those in cluster, but that might be an easy next step (enabled only for the app namespace to not break the whole cluster).
We also already have PSS enforcement in-cluster, not sure if we also have network policies, but that could be added. On this level we could mention that you need a combination of in-cluster enforcement and "adding the actual security rules and exceptions to the app". As in this case we are creating an app from a template this means the template needs to include those things and be "secure by default", which I would guess it is, if it runs smoothly in our clusters.
CVEs scans and reporting would be a good next feature for platform in Q4, but we need to discuss that on a general level and I don't think it makes sense to just smash it into the demo right now, as there the process is more important than just showing CVEs.
There is a lot of value in simpler templates. You could also call them building blocks. They are valuable PE services on their own:
A) Create a Git Repository (ready to use with security & policy) B) Create a Namespace (ready to use with security, limits & policy ) C) Create an EC2 instance (...)
The self service aspect of these templates provides a lot of value. And If we can find a way to combine these building blocks into more complex templates easily. We would have a set of common building blocks and provide a lot of value to possible customers. I know these last points need further thought, investigation and discussion. but we could and should start with these simpler templates first.
Whilst I definitely agree with there being a lot of value in simpler templates, this goes far beyond the scope of the demo journey and more towards turning the demo into a full fledged ready to use platform.
My opinion on the current IDP demo is to attempt to answer some of the hardest questions facing the industry today.
Moving the demo to become a more rounded and evolved product should not be in scope for the demo platform, but should be scoped separately to this current journey as it involves considerable additional thought, planning and implementation that significantly impacts the delivery of key features not even yet given hard consideration.
This will definitely be an iterative process, however trying to implement simpler templates at this stage would have significant impacts on key questions that we've already been asked.
I would propose that discussions on simple templates be moved to a separate "platform progression" epic, except where otherwise in scope for phase 2.
Effectively this leaves B, and potentially A still in scope but C moves out.
Along those lines, I think we should start closing the first demo issue, and create follow-ups, for which we can then discuss priorities also wrt to the many other things Honeybadger should/could do in the next months. I've prepared some list to show the complexity of the roadmap decision for the team, but we need to talk about it soon to get clarity what we want to do going forward (at least as long as we don't have a concrete customer to work with).
I just created a separate ticket with my suggestions for improvements. I've scheduled a call for 6-Nov with the Honeybadger team to discuss.
User Story
Details, Background
In order to take the user on a journey through the IDP, we have an overall story of creating infrastructure components via crossplane and deploying an app that then consumes that infrastructure.
To accomplish the story and really showcase the capabilities of all components in the pipeline the journey is as follows:
redis
as a backendFlow diagrams: https://miro.com/app/board/uXjVKnjQei8=/
Architecture
Blocked by / depends on