operate-first / blueprint

This is the blueprint for the Operate First Initiative
GNU General Public License v3.0
16 stars 17 forks source link

ADR for Data Collection #80

Closed HumairAK closed 2 years ago

HumairAK commented 3 years ago

The Operate First environments will create a vast amount of operational data from platform systems and user workloads. We want to publish the data under a license agreement that is similar to an open source license agreement. We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.

We need an ADR for different options on how to satisfy this requirement.

msdisme commented 3 years ago

Is there an issue for tracking the details of what we would like to capture (I'll want to have similar for Decorus) so that I may use that as a basis for discussions with university legal and IRB.

tumido commented 3 years ago

Data we want to collect and share:

  1. Application logs from all the applications running in the cluster, this can be really anything. If the application logs for example which users are connecting to it, we will collect it. Example (ODH JupyterHub):

    [I 2021-05-05 10:02:42.206 JupyterHub pages:402] tcoufal@redhat.com is pending spawn
    [I 2021-05-05 10:02:42.210 JupyterHub log:189] 200 GET /hub/spawn-pending/tcoufal@redhat.com (tcoufal@redhat.com@::ffff:10.131.0.1) 13.28ms
    10:02:47.190 [ConfigProxy] info: 200 GET /api/routes
    http://10.131.2.139:9090!=http://10.131.2.139:8080
    2021-05-05 10:03:00.462 JupyterHub proxy:282] Adding user tcoufal@redhat.com to proxy /user/tcoufal@redhat.com/ => http://10.131.3.105:8080
    10:03:00.465 [ConfigProxy] info: Adding route /user/tcoufal@redhat.com -> http://10.131.3.105:8080
    10:03:00.465 [ConfigProxy] info: Route added /user/tcoufal@redhat.com -> http://10.131.3.105:8080
    10:03:00.465 [ConfigProxy] info: 201 POST /api/routes/user/tcoufal@redhat.com
    [I 2021-05-05 10:03:00.468 JupyterHub log:189] 200 GET /hub/api (@10.131.3.105) 1.97ms
    [I 2021-05-05 10:03:00.469 JupyterHub users:671] Server tcoufal@redhat.com is ready
    [I 2021-05-05 10:03:00.471 JupyterHub log:189] 200 GET /hub/api/users/tcoufal@redhat.com/server/progress (tcoufal@redhat.com@::ffff:10.131.0.1) 18057.13ms
    [I 2021-05-05 10:03:00.528 JupyterHub log:189] 200 POST /hub/api/users/tcoufal@redhat.com/activity (tcoufal@redhat.com@10.131.3.105) 33.95ms
    [I 2021-05-05 10:03:00.613 JupyterHub log:189] 302 GET /hub/spawn-pending/tcoufal@redhat.com -> /user/tcoufal@redhat.com/ (tcoufal@redhat.com@::ffff:10.131.0.1) 6.94ms
    [I 2021-05-05 10:03:01.023 JupyterHub log:189] 302 GET /hub/api/oauth2/authorize?client_id=jupyterhub-user-tcoufal%2540redhat.com&redirect_uri=%2Fuser%2Ftcoufal%40redhat.com%2Foauth_callback&response_type=code&state=[secret] -> /user/tcoufal@redhat.com/oauth_callback?code=[secret]&state=[secret] (tcoufal@redhat.com@::ffff:10.131.0.1) 34.50ms
    [I 2021-05-05 10:03:01.215 JupyterHub log:189] 200 POST /hub/api/oauth2/token (tcoufal@redhat.com@10.131.3.105) 53.71ms
    [I 2021-05-05 10:03:01.246 JupyterHub log:189] 200 GET /hub/api/authorizations/token/[secret] (tcoufal@redhat.com@10.131.3.105) 24.52ms
    10:03:02.662 [ConfigProxy] info: 200 GET /api/routes
  2. Application metrics if the application exposes them, each application can define what metrics to show. This may include PII, if the username or what not is used to name a pod for example (labels can be anything really). Example (ODH JupyterHub):

    # HELP jupyterhub_server_spawn_duration_seconds time taken for server spawning operation
    # TYPE jupyterhub_server_spawn_duration_seconds histogram
    jupyterhub_server_spawn_duration_seconds_bucket{le="0.5",status="success"} 0.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="1.0",status="success"} 0.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="2.5",status="success"} 0.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="5.0",status="success"} 0.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="10.0",status="success"} 1.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="15.0",status="success"} 8.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="30.0",status="success"} 27.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="60.0",status="success"} 42.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="120.0",status="success"} 52.0
    jupyterhub_server_spawn_duration_seconds_bucket{le="+Inf",status="success"} 57.0
    jupyterhub_server_spawn_duration_seconds_count{status="success"} 57.0
    jupyterhub_server_spawn_duration_seconds_sum{status="success"} 3389.5434402088904
  3. Platform events - events generated by the OCP platform itself. Example (spawning a pod):

    {"apiVersion":"v1","count":1,"eventTime":null,"firstTimestamp":"2021-05-05T10:07:31Z","involvedObject":{"apiVersion":"v1","kind":"Pod","name":"jupyterhub-nb-tcoufal-40redhat-2ecom","namespace":"opf-jupyterhub","resourceVersion":"209536743","uid":"7a27741f-a72d-4f0d-bf17-3cd0d3ede494"},"kind":"Event","lastTimestamp":"2021-05-05T10:07:31Z","message":"Add eth0 [10.131.3.106/23]","metadata":{"creationTimestamp":"2021-05-05T10:07:31Z","managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{"f:apiVersion":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{}},"f:type":{}},"manager":"multus","operation":"Update","time":"2021-05-05T10:07:31Z"}],"name":"jupyterhub-nb-tcoufal-40redhat-2ecom.167c23baee4d0e1c","namespace":"opf-jupyterhub","resourceVersion":"209537358","selfLink":"/api/v1/namespaces/opf-jupyterhub/events/jupyterhub-nb-tcoufal-40redhat-2ecom.167c23baee4d0e1c","uid":"40c934ef-4bd2-4f41-8686-e5c979adec62"},"reason":"AddedInterface","reportingComponent":"","reportingInstance":"","source":{"component":"multus"},"type":"Normal"}
    {"apiVersion":"v1","count":1,"eventTime":null,"firstTimestamp":"2021-05-05T10:07:32Z","involvedObject":{"apiVersion":"v1","fieldPath":"spec.containers{notebook}","kind":"Pod","name":"jupyterhub-nb-tcoufal-40redhat-2ecom","namespace":"opf-jupyterhub","resourceVersion":"209536741","uid":"7a27741f-a72d-4f0d-bf17-3cd0d3ede494"},"kind":"Event","lastTimestamp":"2021-05-05T10:07:32Z","message":"Container image \"quay.io/thoth-station/s2i-minimal-notebook@sha256:eacfa74842ce6330991d945408bb37c3e8f37246ff3f1b98837cf7ae4f5a78af\" already present on machine","metadata":{"creationTimestamp":"2021-05-05T10:07:32Z","managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{"f:apiVersion":{},"f:fieldPath":{},"f:kind":{},"f:name":{},"f:namespace":{},"f:resourceVersion":{},"f:uid":{}},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}},"manager":"kubelet","operation":"Update","time":"2021-05-05T10:07:32Z"}],"name":"jupyterhub-nb-tcoufal-40redhat-2ecom.167c23bb0d9cb74e","namespace":"opf-jupyterhub","resourceVersion":"209537393","selfLink":"/api/v1/namespaces/opf-jupyterhub/events/jupyterhub-nb-tcoufal-40redhat-2ecom.167c23bb0d9cb74e","uid":"4677bec6-ee3a-4866-9d0f-b3c3e06f86f6"},"reason":"Pulled","reportingComponent":"","reportingInstance":"","source":{"component":"kubelet","host":"os-wrk-1"},"type":"Normal"}
  4. Platform logs are similar to the application logs, but generated by the OCP platform itself. Example (OAuth logs):

    I0427 19:19:02.124608       1 named_certificates.go:53] loaded SNI cert [1/"sni-serving-cert::/var/config/system/secrets/v4-0-config-system-router-certs/apps.zero.massopen.cloud::/var/config/system/secrets/v4-0-config-system-router-certs/apps.zero.massopen.cloud"]: "api.zero.massopen.cloud" [serving,client] validServingFor=[*.apps.zero.massopen.cloud,api.zero.massopen.cloud] issuer="R3" (2021-03-08 12:41:20 +0000 UTC to 2021-06-06 12:41:20 +0000 UTC (now=2021-04-27 19:19:02.124599505 +0000 UTC))
    I0427 19:19:02.124830       1 named_certificates.go:53] loaded SNI cert [0/"self-signed loopback"]: "apiserver-loopback-client@1619551141" [serving] validServingFor=[apiserver-loopback-client] issuer="apiserver-loopback-client-ca@1619551141" (2021-04-27 18:19:00 +0000 UTC to 2022-04-27 18:19:00 +0000 UTC (now=2021-04-27 19:19:02.124819977 +0000 UTC))
    E0427 19:21:28.160157       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
    E0427 19:21:28.160157       1 osinserver.go:91] internal error: system:serviceaccount:openshift-logging:kibana has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
    E0427 19:21:40.496897       1 osinserver.go:91] internal error: system:serviceaccount:openshift-logging:kibana has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
    E0427 19:21:40.496905       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
    E0427 19:21:46.638010       1 osinserver.go:91] internal error: system:serviceaccount:opf-jupyterhub:jupyterhub-hub has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
    E0428 14:28:36.180088       1 osinserver.go:91] internal error: system:serviceaccount:opf-monitoring:grafana-serviceaccount has no redirectURIs; set serviceaccounts.openshift.io/oauth-redirecturi.<some-value>=<redirect> or create a dynamic URI using serviceaccounts.openshift.io/oauth-redirectreference.<some-value>=<reference>
    E0503 19:37:02.866939       1 authentication.go:53] Unable to authenticate the request due to an error: [invalid bearer token, context canceled]
  5. Platform metrics - same data structure as the application metrics, but generated by the OCP itself. Sample of kube pod info metric:

kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="", created_by_name="", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-etcd", node="os-ctrl-0", pod="revision-pruner-5-os-ctrl-0", pod_ip="10.130.0.7", priority_class="system-node-critical", service="kube-state-metrics", uid="c5a1bc73-f28b-4e0f-9cc6-c2c7abd5b0b8"} 1 kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="", created_by_name="", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-apiserver", node="os-ctrl-0", pod="revision-pruner-22-os-ctrl-0", pod_ip="10.130.0.139", priority_class="system-node-critical", service="kube-state-metrics", uid="721d7288-3b7a-4460-be92-bea36e3539fa"} 1 kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="", created_by_name="", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-controller-manager", node="os-ctrl-0", pod="revision-pruner-12-os-ctrl-0", pod_ip="10.130.0.141", priority_class="system-node-critical", service="kube-state-metrics", uid="c92a8983-6ba6-42ed-af6f-535aed848e67"} 1 kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="", created_by_name="", endpoint="https-main", host_ip="192.12.185.110", job="kube-state-metrics", namespace="openshift-kube-scheduler", node="os-ctrl-0", pod="revision-pruner-11-os-ctrl-0", pod_ip="10.130.0.140", priority_class="system-node-critical", service="kube-state-metrics", uid="e799030c-703f-4654-b896-8493f3e2dd35"} 1 kube_pod_info{container="kube-rbac-proxy-main", created_by_kind="", created_by_name="", endpoint="https-main", host_ip="192.12.185.111", job="kube-state-metrics", namespace="openshift-etcd", node="os-ctrl-1", pod="revision-pruner-5-os-ctrl-1", pod_ip="10.128.0.121", priority_class="system-node-critical", service="kube-state-metrics", uid="e8df07cc-9ad9-4479-8922-556b2a1cc2ae"} 1



5. We are also collecting data derived from it, like alerts which are directly calculated from metrics, e.g.: https://github.com/operate-first/alerts/issues/5609

Data we are hosting for users and their applications. We're not collecting the data intentionally, but they can share via our platform:
- We provide block storage which is used by applications and users to store data. Direct access to this block storage is available within the platform only and data can be retrieved only via proxy (the application mounting the storage itself).
- We provide object storage that can be interfaced externally - users can access this data from outside of the platform if they have credentials to their object storage bucket.
msdisme commented 3 years ago

Thanks, this is great! should I break the details in the comment above into a different issue or does it make sense for them to live here?

msdisme commented 3 years ago

a quick update, met with the folks who review IRB - scheduling a follow up discussion with them to dive deeper in to the data.

durandom commented 3 years ago

Operational data specifically excludes users own data sets, i.e. it's only data that is generated by the platform: logs, metrics, telemetry. For logs it excludes logs from the workloads pods, but includes logs from the platform pods. E.g. JupyterHub vs etcd For metrics it'll include CPU metrics for workloads pods, but not metrics that the application exposes. E.g. JupyterHub metrcs vs Pod metric

The same definition can be made for workload data, which should be governed by an opt-in or opt-out policy - see https://github.com/operate-first/blueprint/issues/87

billburnseh commented 3 years ago

No updates from BU yet.

billburnseh commented 3 years ago

The Data Usage Agreement (DUA) is on the table and being discussed, including access to telemetry without anonymization.

quaid commented 3 years ago

We want to publish the data under a license agreement that is similar to an open source license agreement. We still have to operate in the boundaries of law and therefore cannot publish data that would break the law.

Let's pull together a workstream to study and advise an approach from an open source licensing approach:

https://github.com/operate-first/community/issues/79

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 2 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 2 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/operate-first/blueprint/issues/80#issuecomment-1019570271): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.