open-services-group / scrum

SCRUM issues and user stories
GNU General Public License v3.0
0 stars 2 forks source link

Define interface of observability capability between service provider and consumer #34

Closed goern closed 1 year ago

goern commented 2 years ago

As an App Team (Devs, service owner), I want to use a standard interface (or contract) with my platform provider, so that I benefit from a set of standard observability capabilities, and so that I get access to a set of application-specific observability capabilities,

As an App Team, I want to adhere to the platform standards for observability, so that standard metrics and application metrics of my application are scrapped automatically.

As an App Team, I want to provide dashboard declarations, so that they get picked up by the observability capabilities of the platform, and so that my application-specific metrics are shown.

References

schwesig commented 2 years ago

/assign @schwesig

schwesig commented 2 years ago

group call: @goern @schwesig @harshad16 @VannTen

it is about:

describe as App Developer App Owner - Responsible Running App on Operate First Operator - Operations & On Call Support? or separate the On Call Support from the operator --> better separate

?Application Supporter?

the service is: e.g. a yaml file in a certain folder will create/ collect the data and/ or deliver it

deliveries: --> user stories --> UML/ Workflow/ SwimLane?

schwesig commented 2 years ago

maybe include/ ask @HumairAK @durandom about the On Call process/ needs

not an API, but interface: exchanging data, files, yaml in both directions maybe including repos/ folders etc triggered by a Pull Request

How and What

schwesig commented 2 years ago

Define the personas (who would/ could play this role in real life) and ask them about needs. AppDev OnCall Platform

VannTen commented 2 years ago

So, a couple (or maybe a little more ^) of thoughts on the subject.

Purposes

The interface/standard should allow the described personas to be able to self-service in the good case. (meaning, if everything works, I don't need to involve other people which might not be available).

Form

To work with automation, I would make the interface quite precise -> something like "Those files with that syntax in that place for that result"

Also, we should probably reuse as much as possible of the already established standards in that cloud observability field, because that means it will be easier for anyone not familiar with the platform. (Openmetrics/OpenTelemetry)

Observability

Multiples dimensions

It's important to note that observability covers multiples things:

Many appplications will not have all of this, so our standard/interface should be composed of mostly independant "modules" (or whatever we call it, "subinterfaces" ?) -> ("I can use the log service even if I don't have metrics implemented").

Levels / Tiers

It might be a possibility to have 2 level in some parts of the interface (for example, with metrics, first depend on the bare default of the prometheus python client, second needs some configuration ?). I might be going out on a limb here.

Different users/personas consume different metrics

Might depend on the application, but from my experience, the AppDev does not usually use the same metrics that the OnCall or PlatformOp. ("what does it do ?" vs "is it up ?")


Note: regarding the location of the files, I don't think we should go for a git repository. An application in the k8s space is already a bunch of yaml files + containers images, and observability's parameters are very much part of the application, so I would bundle them with it (the prometheus CRD are a good example, like ServiceMonitor). Most things can go in either a CRD or an appropriately named/labelled ConfigMap, or use annotations on the objects (filebeat log scrapping can be configured by Pods using annotations for example)


Some user stories from a PlatformOp POV (some apply equally to OnCallOp):

schwesig commented 2 years ago
flowchart TD;
ApLo[App Logs]
OpLo[Operation Logs]
RuBo[Run Books]

ApDe([App Developer])
PlOp([Platform Operator])
OCOp([On Call Operator])

subgraph DePr[Devices and Processes]
    App
    User
    Operator
    OS
    Support
    CoLo[Communication Logs]
end

App --> ApLo
App --> Issues
User --> ApLo
User --> Issues
Operator --> OpLo
OS --> OpLo
Support --> RuBo
Support --> Issues
CoLo --> RuBo

subgraph Out[Output]
    ApLo
    OpLo
    RuBo
    Issues
end

subgraph Pe[Personas]
    ApDe
    PlOp
    OCOp
end

subgraph AuGe[Auto Generated]
    ?A
    ?B
    ?C
end

subgraph MaGe[Manually Generated]
    ?D
    ?E
    ?F
end

subgraph Need[Needed]
    ?G
    ?H
    ?I
end

subgraph Miss[Missing]
    ?J
    ?K
    ?L
end

?1 --> ApDe
?2 --> PlOp
?3 --> OCOp

subgraph In[Input]
    ?1
    ?2
    ?3
end
schwesig commented 2 years ago

sequenceDiagram

    participant App Developer
    participant Platform Operator
    participant User
    participant On Call Operator

    App Developer->>Platform Operator: Deploy App
    Platform Operator->>App Developer: Feedback Running?
    loop Get it started
        App Developer->>On Call Operator: Issue
        On Call Operator->>App Developer: Fixed
        App Developer->>Platform Operator: Deploy App
    end
    App Developer->>User: Accept Users/ Go Live
    loop During Lifetime
        User->>On Call Operator:Issue
        On Call Operator->>App Developer: Fixed
        App Developer->>Platform Operator: Deploy App
    end
schwesig commented 1 year ago

/close cleanup after changing orga