tensorflow / io

Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Apache License 2.0
696 stars 282 forks source link

Reading from Prometheus yields incorrect tensor format #717

Closed sbaier1 closed 4 years ago

sbaier1 commented 4 years ago

I'm trying to ingest data from Prometheus for multivariate time-series analysis and am running into some issues. My query returns multiple range vectors from different metrics, like such: Bildschirmfoto 2020-01-07 um 18 32 34 with a long values array for each metric.

However, when running that same query using TFIO's from_prometheus method:

>>> IOTensor.from_prometheus(query)
<PrometheusIOTensor: spec=(TensorSpec(shape=(491,), dtype=tf.int64, name=None), TensorSpec(shape=(491, 1), dtype=tf.float64, name=None))>

The shape is not what the metric API returns, the shape should be (n,4) for the n amount of data points in the range vectors.

an example query that would give you a similar result on any Prometheus deployment for testing would be something similar to: ({__name__ =~ "apiserver_request_.+"})[1h:1m] you can use a "subquery" to get a range vector for an instant vector query.

sbaier1 commented 4 years ago

it appears that the go binding only writes values for the first metric in the matrix received from Prometheus https://github.com/tensorflow/io/blob/master/tensorflow_io/core/go/prometheus.go#L38

yongtang commented 4 years ago

@sbaier1 Thanks for raising the issue. I will take a look.

sbaier1 commented 4 years ago

great!

as a side note, i just thought about how this would be implemented (building a workaround for myself right now) and i believe it might make sense to assign the metric dictionary as the name field of the TensorSpec, or setting a selected field (probably __name__) in the metric dictionary as the name of the tensorspec. (if this is even possible or makes sense)

Prometheus will, by design, return the result columns in a non-deterministic order meaning that the resulting dataset isn't consistent when querying repeatedly. In general sorting the output tensor columns by __name__ should work but i'm sure that there are use cases that would want to sort by other or multiple labels from the metric dictionary.

So in summary, not sure how to handle this properly in general.

Another option would be continuing to only return the (timestamp, value) tensor format but allowing proper range queries in the signature. This would allow the user to query each metric in the specific timeframe separately and then build a dataset from that. The problem at the moment is that using subqueries if we query multiple times the timeframe shifts because it always uses the timeframe starting from the instant the query arrives at prometheus.

What do you think?

yongtang commented 4 years ago

@sbaier1 I noticed that __name__ could have a collision in case of multiple jobs or multiple instances. For example, a query with up might be:

curl http://localhost:9090/api/v1/query?query=up

with response:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"up","instance":"localhost:9090","job":"prometheus"},"value":[1578935231.08,"1"]},{"metric":{"__name__":"up","instance":"localhost:9153","job":"coredns"},"value":[1578935231.08,"1"]}]}}

In that situation, I think an option is to return a structured entries in:

{
   'job1': {
      'instance1': {
        'up': ...,
        'other_metric': ...,
      }
      'instance2': {
        'up': ...,
        'other_metric': ...,
      }
    },
   'job2': {

    }
}

TF is capable of processing structured dataset so this might not be an challenge. In eager mode the structure could be probed automatically. (In graph mode user may need to provide a structure).

Will spend some time to see if the above route is feasible.

sbaier1 commented 4 years ago

sounds like that would be the preferrable approach indeed moving forward 👍

yongtang commented 4 years ago

Added PR #746 for structured value support.

The structure of the value (in a complex query) is automatically probed in eager mode.

In graph mode due to the way graph mode works, the spec has to be provided by user (through passing a spec arg).

I also moves the API from tfio.IODataset.from_prometheus to tfio.experimenta.IODataset.from_prometheus, as there are API changes. We will move back after API stabilize in (maybe) one or two releases.