e2e read storeAPI -> aggregate -> write to parquet

iNecas commented 4 years ago

On the path for the first working e2e:

[x] - basic structures
[x] - finish StoreApi input to connect to the grps api and produce the input.SeriesIterator
[x] - parquet implementation for writer interface: should be straight forward at this point, as it should be mainly mapping the dataframe.Schema to the one the paruqet library understands
[x] - usage from cmd

Example (current state):

# Pull data for a specific metric from a StoreAPI (sidecar or store) and save into parquet
$ go run ./cmd/obslytics export --input-config='{"endpoint":"127.0.0.1:10901","tls_config":{"insecure_skip_verify":true}}'\
      --match="net_conntrack_dialer_conn_attempted_total"\
      --resolution=1h --min-time="$(date -uI)T00:00:00Z" --max-time="$(date -uI)T23:59:59Z"\
      --out=net_conntrack_dialer_conn_attempted_total.parquet\
      --debug                                         
| dialer_name     instance        job         prometheus  _sample_start  _sample_end  _min_time  _max_time  _count  _sum    _min  _max  |
| default         localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     0       0     0     |
| prometheus      localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     37128   1     272   |
| thanos-query    localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     544     2     2     |
| thanos-receive  localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     110840  1     814   |
| thanos-sidecar  localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     272     1     1     |
| thanos-store    localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     271     0     1     |
| default         localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     0       0     0     |
| prometheus      localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     196458  273   683   |
| thanos-query    localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     822     2     2     |
| thanos-receive  localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     588552  817   2047  |
| thanos-sidecar  localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     411     1     1     |
| thanos-store    localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     411     1     1     |
level=info ts=2020-08-14T12:34:13.687660836Z caller=main.go:108 msg=exiting cmd=export

# Example of loading the parquet file from Python:
$ ipython -c 'import pandas as pd; pd.read_parquet("net_conntrack_dialer_conn_attempted_total.parquet")'
Out[1]: 
       dialer_name        instance         job prometheus       _sample_start         _sample_end           _min_time           _max_time  _count      _sum   _min    _max
0          default  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272       0.0    0.0     0.0
1       prometheus  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272   37128.0    1.0   272.0
2     thanos-query  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272     544.0    2.0     2.0
3   thanos-receive  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272  110840.0    1.0   814.0
4   thanos-sidecar  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272     272.0    1.0     1.0
5     thanos-store  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272     271.0    0.0     1.0
6          default  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411       0.0    0.0     0.0
7       prometheus  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411  196458.0  273.0   683.0
8     thanos-query  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411     822.0    2.0     2.0
9   thanos-receive  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411  588552.0  817.0  2047.0
10  thanos-sidecar  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411     411.0    1.0     1.0
11    thanos-store  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411     411.0    1.0     1.0

bwplotka commented 4 years ago

Added some comments, but it's hard to speak about interfaces without actual first implementations first.

iNecas commented 4 years ago

Agreed, I mainly wanted to be transparent about the direction. Thanks for the comments, will incorporate the parts that would be clear, the fogy ones I would leave until we have something e2e

iNecas commented 4 years ago

@bwplotka I've filled in more gaps in the aggregator implementation. The interfaces start forming up. What's missing for end2end is:

[x] - finish StoreApi input to connect to the grps api and produce the input.SeriesIterator
[x] - parquet implementation for writer interface: should be straight forward at this point, as it should be mainly mapping the dataframe.Schema to the one the paruqet library understands
[x] - add command line options to allow running the code end2end.

I would like to finish the first two points, but would be great for the command line options to be picked by somebody else, as I might not have much time left this week to finish those. Can be also good exercise to play with the code a bit and refactor if needed.

4n4nd commented 4 years ago

but would be great for the command line options to be picked by somebody else,

@iNecas I can try to do that, but I might need some hand holding initially.

iNecas commented 4 years ago

One step closer: the StoreAPI connected and some initial connection to allow running almost end2end

go run ./cmd/obslytics export --input='{"endpoint":"127.0.0.1:10901","tls_config":{"insecure_skip_verify":true}}' --min-time="2020-07-03T13:15:04Z" --max-time="2020-07-03T18:00:00Z" --metric="net_conntrack_dialer_conn_attempted_total" --resolution=30m
| _id  dialer_name   instance        job         _sample_start  _sample_end  _min_time  _max_time  _count  _sum  _min  _max  |
| 1    alertmanager  localhost:9090  prometheus  13:30:00       14:00:00     13:30:04   13:39:49   40      0     0     0     |
| 1    alertmanager  localhost:9090  prometheus  16:00:00       16:30:00     16:29:49   16:29:49   1       0     0     0     |
| 1    alertmanager  localhost:9090  prometheus  16:30:00       17:00:00     16:30:04   16:59:49   120     0     0     0     |
| 1    alertmanager  localhost:9090  prometheus  17:00:00       17:30:00     17:00:04   17:29:49   120     0     0     0     |
| 1    default       localhost:9090  prometheus  13:30:00       14:00:00     13:30:04   13:39:49   40      0     0     0     |
| 1    default       localhost:9090  prometheus  16:00:00       16:30:00     16:29:49   16:29:49   1       0     0     0     |
| 1    default       localhost:9090  prometheus  16:30:00       17:00:00     16:30:04   16:59:49   120     0     0     0     |
| 1    default       localhost:9090  prometheus  17:00:00       17:30:00     17:00:04   17:29:49   120     0     0     0     |
| 1    prometheus    localhost:9090  prometheus  13:30:00       14:00:00     13:30:04   13:39:49   40      40    1     1     |
| 1    prometheus    localhost:9090  prometheus  16:00:00       16:30:00     16:29:49   16:29:49   1       1     1     1     |
| 1    prometheus    localhost:9090  prometheus  16:30:00       17:00:00     16:30:04   16:59:49   120     120   1     1     |
| 1    prometheus    localhost:9090  prometheus  17:00:00       17:30:00     17:00:04   17:29:49   120     120   1     1     |
| 1    alertmanager  localhost:9090  prometheus  17:30:00       18:00:00     17:30:04   17:59:49   120     0     0     0     |
| 1    default       localhost:9090  prometheus  17:30:00       18:00:00     17:30:04   17:59:49   120     0     0     0     |
| 1    prometheus    localhost:9090  prometheus  17:30:00       18:00:00     17:30:04   17:59:49   120     120   1     1     |

I've tested just with my very limited thanos instance: would be great to see some real-world performance.

Next step: add the parquet piece.

@4n4nd once it gets end2end, it might be more clear how to move the thing forward. It's getting close.

iNecas commented 4 years ago

Btw. I've got a bit futher with the cli part, to the initial usage might be there at the time the parquet writer is finished.

iNecas commented 4 years ago

So reached to the point of being able to run this ting end2end:

# Pull data for a specific metric from a StoreAPI (sidecar or store) and save into parquet
$ go run ./cmd/obslytics export --input-cfg='{"endpoint":"127.0.0.1:10901","tls_config":{"insecure_skip_verify":true}}'\
      --metric="net_conntrack_dialer_conn_attempted_total"\
      --resolution=1h --min-time="$(date -uI)T00:00:00Z" --max-time="$(date -uI)T23:59:59Z"\
      --out=net_conntrack_dialer_conn_attempted_total.parquet\
      --debug                                         
| dialer_name     instance        job         prometheus  _sample_start  _sample_end  _min_time  _max_time  _count  _sum    _min  _max  |
| default         localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     0       0     0     |
| prometheus      localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     37128   1     272   |
| thanos-query    localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     544     2     2     |
| thanos-receive  localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     110840  1     814   |
| thanos-sidecar  localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     272     1     1     |
| thanos-store    localhost:9090  prometheus  prom-0      11:00:00       12:00:00     11:37:21   11:59:56   272     271     0     1     |
| default         localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     0       0     0     |
| prometheus      localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     196458  273   683   |
| thanos-query    localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     822     2     2     |
| thanos-receive  localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     588552  817   2047  |
| thanos-sidecar  localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     411     1     1     |
| thanos-store    localhost:9090  prometheus  prom-0      12:00:00       13:00:00     12:00:01   12:34:11   411     411     1     1     |
level=info ts=2020-08-14T12:34:13.687660836Z caller=main.go:108 msg=exiting cmd=export

# Example of loading the parquet file from Python:
$ ipython -c 'import pandas as pd; pd.read_parquet("net_conntrack_dialer_conn_attempted_total.parquet")'
Out[1]: 
       dialer_name        instance         job prometheus       _sample_start         _sample_end           _min_time           _max_time  _count      _sum   _min    _max
0          default  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272       0.0    0.0     0.0
1       prometheus  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272   37128.0    1.0   272.0
2     thanos-query  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272     544.0    2.0     2.0
3   thanos-receive  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272  110840.0    1.0   814.0
4   thanos-sidecar  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272     272.0    1.0     1.0
5     thanos-store  localhost:9090  prometheus     prom-0 2020-08-14 11:00:00 2020-08-14 12:00:00 2020-08-14 11:37:21 2020-08-14 11:59:56     272     271.0    0.0     1.0
6          default  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411       0.0    0.0     0.0
7       prometheus  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411  196458.0  273.0   683.0
8     thanos-query  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411     822.0    2.0     2.0
9   thanos-receive  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411  588552.0  817.0  2047.0
10  thanos-sidecar  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411     411.0    1.0     1.0
11    thanos-store  localhost:9090  prometheus     prom-0 2020-08-14 12:00:00 2020-08-14 13:00:00 2020-08-14 12:00:01 2020-08-14 12:34:11     411     411.0    1.0     1.0

There is still big room for improvement, but should be enough to start playing with, use it it for some real use-case and enhance whatever needed. I would like to get this PR merged with limited set of additional changes (that I might not have capacity for in the following weeks) and continue in further increments on top of the main branch.

iNecas commented 4 years ago

@bwplotka it should actually flushing the data at each sample: the finalize is just to dealing with the rest of the data that have not reached to the end of next sample.

bwplotka commented 4 years ago

Feel free @iNecas to create PRs without fork (on obslytics branch) - it might be easier for us take collaborate

thanos-community / obslytics

e2e read storeAPI -> aggregate -> write to parquet #3