Closed goern closed 1 year ago
/assign @schwesig
group call: @goern @schwesig @harshad16 @VannTen
it is about:
describe as App Developer App Owner - Responsible Running App on Operate First Operator - Operations & On Call Support? or separate the On Call Support from the operator --> better separate
?Application Supporter?
the service is: e.g. a yaml file in a certain folder will create/ collect the data and/ or deliver it
deliveries: --> user stories --> UML/ Workflow/ SwimLane?
maybe include/ ask @HumairAK @durandom about the On Call process/ needs
not an API, but interface: exchanging data, files, yaml in both directions maybe including repos/ folders etc triggered by a Pull Request
How and What
Define the personas (who would/ could play this role in real life) and ask them about needs. AppDev OnCall Platform
So, a couple (or maybe a little more ^) of thoughts on the subject.
The interface/standard should allow the described personas to be able to self-service in the good case. (meaning, if everything works, I don't need to involve other people which might not be available).
To work with automation, I would make the interface quite precise -> something like "Those files with that syntax in that place for that result"
Also, we should probably reuse as much as possible of the already established standards in that cloud observability field, because that means it will be easier for anyone not familiar with the platform. (Openmetrics/OpenTelemetry)
It's important to note that observability covers multiples things:
Many appplications will not have all of this, so our standard/interface should be composed of mostly independant "modules" (or whatever we call it, "subinterfaces" ?) -> ("I can use the log service even if I don't have metrics implemented").
It might be a possibility to have 2 level in some parts of the interface (for example, with metrics, first depend on the bare default of the prometheus python client, second needs some configuration ?). I might be going out on a limb here.
Might depend on the application, but from my experience, the AppDev does not usually use the same metrics that the OnCall or PlatformOp. ("what does it do ?" vs "is it up ?")
Note: regarding the location of the files, I don't think we should go for a git repository. An application in the k8s space is already a bunch of yaml files + containers images, and observability's parameters are very much part of the application, so I would bundle them with it (the prometheus CRD are a good example, like ServiceMonitor). Most things can go in either a CRD or an appropriately named/labelled ConfigMap, or use annotations on the objects (filebeat log scrapping can be configured by Pods using annotations for example)
Some user stories from a PlatformOp POV (some apply equally to OnCallOp):
flowchart TD;
ApLo[App Logs]
OpLo[Operation Logs]
RuBo[Run Books]
ApDe([App Developer])
PlOp([Platform Operator])
OCOp([On Call Operator])
subgraph DePr[Devices and Processes]
App
User
Operator
OS
Support
CoLo[Communication Logs]
end
App --> ApLo
App --> Issues
User --> ApLo
User --> Issues
Operator --> OpLo
OS --> OpLo
Support --> RuBo
Support --> Issues
CoLo --> RuBo
subgraph Out[Output]
ApLo
OpLo
RuBo
Issues
end
subgraph Pe[Personas]
ApDe
PlOp
OCOp
end
subgraph AuGe[Auto Generated]
?A
?B
?C
end
subgraph MaGe[Manually Generated]
?D
?E
?F
end
subgraph Need[Needed]
?G
?H
?I
end
subgraph Miss[Missing]
?J
?K
?L
end
?1 --> ApDe
?2 --> PlOp
?3 --> OCOp
subgraph In[Input]
?1
?2
?3
end
sequenceDiagram
participant App Developer
participant Platform Operator
participant User
participant On Call Operator
App Developer->>Platform Operator: Deploy App
Platform Operator->>App Developer: Feedback Running?
loop Get it started
App Developer->>On Call Operator: Issue
On Call Operator->>App Developer: Fixed
App Developer->>Platform Operator: Deploy App
end
App Developer->>User: Accept Users/ Go Live
loop During Lifetime
User->>On Call Operator:Issue
On Call Operator->>App Developer: Fixed
App Developer->>Platform Operator: Deploy App
end
/close cleanup after changing orga
As an App Team (Devs, service owner), I want to use a standard interface (or contract) with my platform provider, so that I benefit from a set of standard observability capabilities, and so that I get access to a set of application-specific observability capabilities,
As an App Team, I want to adhere to the platform standards for observability, so that standard metrics and application metrics of my application are scrapped automatically.
As an App Team, I want to provide dashboard declarations, so that they get picked up by the observability capabilities of the platform, and so that my application-specific metrics are shown.
References