goern commented 2 years ago

As an App Team (Devs, service owner), I want to use a standard interface (or contract) with my platform provider, so that I benefit from a set of standard observability capabilities, and so that I get access to a set of application-specific observability capabilities,

As an App Team, I want to adhere to the platform standards for observability, so that standard metrics and application metrics of my application are scrapped automatically.

As an App Team, I want to provide dashboard declarations, so that they get picked up by the observability capabilities of the platform, and so that my application-specific metrics are shown.

References

check with app-sre and op1st-sre

schwesig commented 2 years ago

/assign @schwesig

schwesig commented 2 years ago

group call: @goern @schwesig @harshad16 @VannTen

NOT some kind of dashboard, but clear interfaces
NOT numbers
I appdev create numbers/ logs/ basic metrics (access, responsiveness, performance, ...) by just using a platform
I appdev create metrics on the app side, like level metrics (users, wrong password, ...)

it is about:

what data do we already collect
and how
what is missing, and how we can collect that
standards? define? create?

describe as App Developer App Owner - Responsible Running App on Operate First Operator - Operations & On Call Support? or separate the On Call Support from the operator --> better separate

?Application Supporter?

the service is: e.g. a yaml file in a certain folder will create/ collect the data and/ or deliver it

deliveries: --> user stories --> UML/ Workflow/ SwimLane?

schwesig commented 2 years ago

maybe include/ ask @HumairAK @durandom about the On Call process/ needs

not an API, but interface: exchanging data, files, yaml in both directions maybe including repos/ folders etc triggered by a Pull Request

How and What

schwesig commented 2 years ago

Define the personas (who would/ could play this role in real life) and ask them about needs. AppDev OnCall Platform

VannTen commented 2 years ago

So, a couple (or maybe a little more ^) of thoughts on the subject.

Purposes

The interface/standard should allow the described personas to be able to self-service in the good case. (meaning, if everything works, I don't need to involve other people which might not be available).

Form

To work with automation, I would make the interface quite precise -> something like "Those files with that syntax in that place for that result"

Also, we should probably reuse as much as possible of the already established standards in that cloud observability field, because that means it will be easier for anyone not familiar with the platform. (Openmetrics/OpenTelemetry)

Observability

Multiples dimensions

It's important to note that observability covers multiples things:

Metrics -> mostly numeric aggregated data
Alerts -> derived from metrics
Log -> some log agents can also produces metrics from the log
Traces -> following an operation across micro-services boundaries (No experience from me here)

Many appplications will not have all of this, so our standard/interface should be composed of mostly independant "modules" (or whatever we call it, "subinterfaces" ?) -> ("I can use the log service even if I don't have metrics implemented").

Levels / Tiers

It might be a possibility to have 2 level in some parts of the interface (for example, with metrics, first depend on the bare default of the prometheus python client, second needs some configuration ?). I might be going out on a limb here.

Different users/personas consume different metrics

Might depend on the application, but from my experience, the AppDev does not usually use the same metrics that the OnCall or PlatformOp. ("what does it do ?" vs "is it up ?")

Note: regarding the location of the files, I don't think we should go for a git repository. An application in the k8s space is already a bunch of yaml files + containers images, and observability's parameters are very much part of the application, so I would bundle them with it (the prometheus CRD are a good example, like ServiceMonitor). Most things can go in either a CRD or an appropriately named/labelled ConfigMap, or use annotations on the objects (filebeat log scrapping can be configured by Pods using annotations for example)

Some user stories from a PlatformOp POV (some apply equally to OnCallOp):

As a Platform operator/On Call app operator, I want a standardised set of metrics exposed by the app answering the questions:
- is the app up ?
- how much of it's capacity budget is it using ? (aka, what's the load ?)
As a Platform operator/On call app operator, I want a standard dashboard using the standard metrics applicable on every application so I can check on the current status of that application.
As a Platform operator, I want a standard set of alerts using those metrics for each application so I don't need to watch them and can still know if something does not work (or if everything does not work)

schwesig commented 2 years ago

flowchart TD;
ApLo[App Logs]
OpLo[Operation Logs]
RuBo[Run Books]

ApDe([App Developer])
PlOp([Platform Operator])
OCOp([On Call Operator])

subgraph DePr[Devices and Processes]
    App
    User
    Operator
    OS
    Support
    CoLo[Communication Logs]
end

App --> ApLo
App --> Issues
User --> ApLo
User --> Issues
Operator --> OpLo
OS --> OpLo
Support --> RuBo
Support --> Issues
CoLo --> RuBo

subgraph Out[Output]
    ApLo
    OpLo
    RuBo
    Issues
end

subgraph Pe[Personas]
    ApDe
    PlOp
    OCOp
end

subgraph AuGe[Auto Generated]
    ?A
    ?B
    ?C
end

subgraph MaGe[Manually Generated]
    ?D
    ?E
    ?F
end

subgraph Need[Needed]
    ?G
    ?H
    ?I
end

subgraph Miss[Missing]
    ?J
    ?K
    ?L
end

?1 --> ApDe
?2 --> PlOp
?3 --> OCOp

subgraph In[Input]
    ?1
    ?2
    ?3
end

schwesig commented 2 years ago


sequenceDiagram

    participant App Developer
    participant Platform Operator
    participant User
    participant On Call Operator

    App Developer->>Platform Operator: Deploy App
    Platform Operator->>App Developer: Feedback Running?
    loop Get it started
        App Developer->>On Call Operator: Issue
        On Call Operator->>App Developer: Fixed
        App Developer->>Platform Operator: Deploy App
    end
    App Developer->>User: Accept Users/ Go Live
    loop During Lifetime
        User->>On Call Operator:Issue
        On Call Operator->>App Developer: Fixed
        App Developer->>Platform Operator: Deploy App
    end

schwesig commented 1 year ago

/close cleanup after changing orga

open-services-group / scrum

Define interface of observability capability between service provider and consumer #34

References

Purposes

Form

Observability

Multiples dimensions

Levels / Tiers

Different users/personas consume different metrics