m-lab / traceroute-caller

A sidecar service which runs traceroute after a connection closes
Apache License 2.0
18 stars 5 forks source link

Add bigquery key-value tags to ScamperOutput struct #123

Closed cristinaleonr closed 2 years ago

cristinaleonr commented 2 years ago

When loading data to BigQuery, the Gardener is not able to find some fields from the ScamperOutput struct such as CycleStart.ListName because it does not have a mapping to the list_name tag in BigQuery.

Adding bigquery tags to the struct.

Example error:

{
  "textPayload": "2021/09/20 10:58:31 actions.go:126: --- 20210919:ndt/scamper1 Load {Location: \"gs://etl-mlab-sandbox/ndt/scamper1/2021/09/19/20210919T003001.876612Z-scamper1-mlab1-bog03-ndt.tgz.json\"; Message: \"Error while reading data, error message: JSON parsing error in row starting at position 0: No such field: raw.CycleStart.list_name.\"; Reason: \"invalid\"}\n",
  "insertId": "8jl2j1xpvjsntyhv",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "cluster_name": "data-processing",
      "location": "us-east1",
      "pod_name": "etl-gardener-universal-67fb879b4b-kfk7t",
      "namespace_name": "default",
      "project_id": "mlab-sandbox",
      "container_name": "etl-gardener"
    }
  },
  "timestamp": "2021-09-20T10:58:31.514515778Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/pod-template-hash": "67fb879b4b",
    "k8s-pod/run": "etl-gardener-universal",
    "compute.googleapis.com/resource_name": "gke-data-processing-default-pool-f4fc41ed-xqmo"
  },
  "logName": "projects/mlab-sandbox/logs/stderr",
  "receiveTimestamp": "2021-09-20T10:58:35.364763641Z"
}

https://pantheon.corp.google.com/logs/query;cursorTimestamp=2021-09-20T10:58:31.514515778Z;query=resource.type%3D%22k8s_container%22%0Aresource.labels.project_id%3D%22mlab-sandbox%22%0Aresource.labels.location%3D%22us-east1%22%0Aresource.labels.cluster_name%3D%22data-processing%22%0Aresource.labels.namespace_name%3D%22default%22%0Alabels.k8s-pod%2Frun%3D%22etl-gardener-universal%22%20severity%3E%3DDEFAULT%0A%22scamper1%22%0Atimestamp%3D%222021-09-20T10:58:31.514515778Z%22%0AinsertId%3D%228jl2j1xpvjsntyhv%22%0Atimestamp%3D%222021-09-20T10:58:31.514515778Z%22%0AinsertId%3D%228jl2j1xpvjsntyhv%22;timeRange=2021-09-20T13:56:20.777Z%2F2021-09-20T13:56:20.777Z--PT6H?project=mlab-sandbox


This change is Reviewable

cristinaleonr commented 2 years ago

As a background question, I'd like to know what the best practice is for cases like this. In other words, can we have just one "one source of truth"?

These fields are only defined in TRC, which the pipeline references. If we repeated these fields in the pipeline, we wouldn't need this TRC change, but we only want to have them defined in one place (TRC).

That said, I understand your point that development in the pipeline shouldn't require to go back and make changes in TRC. My suggestion would be to have someone from the pipeline take a look at the common schema with this in mind while it's in design/development (I could have done it, but didn't know--at least now we know for next time).