prometheus / prometheus

The Prometheus monitoring system and time series database.
https://prometheus.io/
Apache License 2.0
54.93k stars 9.04k forks source link

prometheus read remote influxdb Invalid after a period of time #4739

Closed chenxu1990 closed 4 years ago

chenxu1990 commented 5 years ago

Proposal

Use case. Why is this important?

“Nice to have” is not a good use case. :)

Bug Report

What did you do? Two prometheus write data to influxdb. An other prometheus read data from influxdb by influxdb's api and grafana generates charts. What did you expect to see? Prometheus can get data from influxdb.

What did you see instead? Under which circumstances? It worked properly and after some hours it can not get newly added data. If you restart the read prometheus , it will be ok again.

Environment centos 7

Alertmanager configuration

alerting: alertmanagers:

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label job=<job_name> to any timeseries scraped from this config.

- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:9090']

scrape_configs:

Override the global default and scrape targets from this job every 5 seconds.

- job_name: 'pushgateway'

scrape_interval: 30s

static_configs:

- targets: ['172.16.68.221:9091']

labels:

group: 'pushgateway'

- job_name: 'ecs_group'

scrape_interval: 30s

file_sd_configs:

- refresh_interval: 1m

files:

- ./conf.d/*.json

- job_name: 'speech_gw'

metrics_path: '/debug/metrics'

scrape_interval: 1m

file_sd_configs:

- refresh_interval: 1m

files:

- ./conf.d/sc-speech-gw.yml

#

remote_write:

- url: "http://influxdb_adapter:9201/write"

remote_read:

insert Prometheus and Alertmanager logs relevant to the issue here
simonpasquier commented 5 years ago

Please share the logs of the Prometheus server. Anything relevant in the InfluxDB logs?

chenxu1990 commented 5 years ago

@simonpasquier Hi simon, there is no obvious error logs and the grafana chart is below image As the picture shows, prometheus can not get the data after 20:00. If I restart it , the picture will be ok

simonpasquier commented 5 years ago

Try running with --log.level=debug. You can also take a look at the net_conntrack*{dialer_name="remote_storage"} metrics.

chenxu1990 commented 5 years ago

@simonpasquier logs: pro_prometheus-front.1.2tw5pr40pqrz@rokid-ops-1.hz.rokid.com | level=info ts=2018-10-15T08:26:15.762344509Z caller=main.go:523 msg="Server is ready to receive web requests." pro_prometheus-front.1.2tw5pr40pqrz@rokid-ops-1.hz.rokid.com | level=debug ts=2018-10-15T08:26:15.762769295Z caller=manager.go:183 component="discovery manager notify" msg="discoverer exited" provider=string/0 pro_prometheus-front.1.2tw5pr40pqrz@rokid-ops-1.hz.rokid.com | level=info ts=2018-10-15T11:27:14.608373524Z caller=compact.go:398 component=tsdb msg="write block" mint=1539590400000 maxt=1539597600000 ulid=01CSVQNS4D6G07DPAJNW4VCJE4 pro_prometheus-front.1.2tw5pr40pqrz@rokid-ops-1.hz.rokid.com | level=info ts=2018-10-15T11:27:14.613941143Z caller=head.go:446 component=tsdb msg="head GC completed" duration=1.73509ms

net_conntrack*{dialer_name="remote_storage"} return no data

simonpasquier commented 5 years ago

net_conntrack*{dialer_name="remote_storage"} return no data

try this {__name__=~"net_conntrack.+",dialer_name="remote_storage"} instead.

chenxu1990 commented 5 years ago

image

@simonpasquier

chenxu1990 commented 5 years ago

Hi simon, I check the influxdb logs

172.18.0.12,172.16.68.221 - - [15/Oct/2018:21:30:52 +0800] "POST /query?db=prometheus&epoch=ms&params=%7B%7D&q=SELECT+value+FROM+%22autogen%22.%2F%5Enet_conntrack.%2B%24%2F+WHERE+%22dialer_name%22+%3D+%27remote_storage%27+AND+time+%3E%3D+1539566700000ms+AND+time+%3C%3D+1539604800000ms+GROUP+BY+%2A HTTP/1.1" 200 4285 "-" "InfluxDBClient" 8950769b-d07e-11e8-ba06-000000000000 22427

I query the data on 21:30:52 but prometheus filter the data before 20:00(1539604800000), there are some other same logs. The last query time stop at 20:00... @simonpasquier

simonpasquier commented 5 years ago

You may need to tweak the read_recent flag.

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read

chenxu1990 commented 5 years ago

The parameter means to qeury data from remote storage each time ,but I don't have local storage .. What is the impact of this parameter, I am a bit confused ? Thanks : )

simonpasquier commented 5 years ago

Which flags do you use to start Prometheus?

chenxu1990 commented 5 years ago

@simonpasquier

  - '--config.file=/etc/prometheus/prometheus.yml'
  - '--web.enable-lifecycle'
  - '--log.level=debug'
simonpasquier commented 5 years ago

but I don't have local storage

There's always local storage.

chenxu1990 commented 5 years ago

You means prometheus get data and cache them into memory. But when I refresh grafana, I always can see a read request to influxdb and the time of intercepting the data is not correct just like above log. @simonpasquier

simonpasquier commented 5 years ago

Can you confirm that you use the native InfluxDB remote read endpoint? Can you check that all clocks are synchronized? Have you tried setting read_recent to true? When I say that there's always local storage, it means that Prometheus will always write the samples to its local storage even when remote write/read is used.

chenxu1990 commented 5 years ago

I set read_recent to true and the problem has been solved , thank you . But I am wondering why this problem does not occur when remote_read and remote_write are on one machine? @simonpasquier

simonpasquier commented 5 years ago

Can you check that all clocks are synchronized?

chenxu1990 commented 5 years ago

All clocks are synchronized but they are in different time zones. One write prometheus is in UTC time zone and others in UTC + 8:00 timezone. @simonpasquier 

simonpasquier commented 5 years ago

It shouldn't matter for Prometheus as all times are converted to UTC. I can't say for InfluxDB.

chenxu1990 commented 5 years ago

It shouldn't matter for InfluxDB because writring to InfluxDB is totally OK . The error is that the time period for fetching data is incorrect when remote_read and remote_write are assigned to different machines.

ghost commented 5 years ago

I have the same problem.

tehlers320 commented 5 years ago

Seeing this problem with prom v2.12.0 and v.2.11.2 with influxdb 1.6.6, There's nothing in the logs when this bad state occurs even with debug logging enabled.

i've set the following on the latest attempt to "fix it", should i up the "retention" it seems to fail right after the 6 hours is up?

            "--storage.tsdb.path=/prometheus",
            "--web.console.libraries=/usr/share/prometheus/console_libraries",
            "--web.console.templates=/usr/share/prometheus/consoles",
            "--storage.tsdb.allow-overlapping-blocks",
            "--storage.tsdb.retention.time=6h",
            "--storage.tsdb.no-lockfile",
            "--storage.tsdb.retention.size=5GB",
            "--query.max-samples=50000000",
            "--query.max-concurrency=20",
            "--query.timeout=2m",
            "--query.lookback-delta=5m",
            "--storage.remote.read-concurrent-limit=10",
            "--storage.remote.read-sample-limit=5e7",
            "--storage.remote.flush-deadline=5s",
            "--web.max-connections=512",
            "--web.read-timeout=5m",
            "--log.level=debug

edit: I have not set read_recent=true this does not seem like an elegant solution, it sounds like this will cause writing to disk.

brian-brazil commented 4 years ago

We've looked at this as part of our bug scrub, and this appears to be a support request that doesn't indicate any particular bug in Prometheus.

If you've further questions they'd be best asked on the prometheus-users mailing list

tehlers320 commented 4 years ago

@brian-brazil influxdb 1.7.x is particularly sensitive to versions. I've found rolling all the way back to prometheus 2.4.3 is the most stable for influx/prom and their library in influxdb reflect that as well. They havent been updated in the 1.x line in some time.

Use 2.4.3 and below.

ctmuthu commented 4 years ago
Screenshot 2020-04-29 at 1 50 15 PM

@tehlers320 I'm unable to fetch the measurements from InfluxDB to prometheus.

Prometheus config file:

global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. scrape_timeout: 15s # scrape_timeout is set to the global default (10s).

remote_read:

Can someone help me here or state the mistake I'm making here

tehlers320 commented 4 years ago

add read_recent: true and add &rp=autogen or whatever your rp is to the end of the url. Influx at some point made this required on the api and didnt document it.

ctmuthu commented 4 years ago

@tehlers320 I updated the configuration: global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. scrape_timeout: 15s # scrape_timeout is set to the global default (10s).

remote_read:

Still it doesn't seem to work.

ctmuthu commented 4 years ago

Dear @simonpasquier @brian-brazil @tehlers320 @chenxu1990 ,

Should I use some additional Prometheus exporter along with these configurations? Has anyone recently imported measurements in influxDB database to Prometheus using remote_read or by some other means. This info might be really helpful to me. Thanks in advance.

aqua-terra commented 4 years ago

@ctmuthu were you able to find the resolution to this issue? i'm having the same problem. Thanks.

ctmuthu commented 4 years ago

Dear @aqua-terra,

Use this configuration for InfuxDB

bind-address = ":8088" [meta] dir = "/var/lib/influxdb/meta" retention-autocreate = true logging-enabled = true

[data] dir = "/var/lib/influxdb/data" engine = "tsm1" wal-dir = "/var/lib/influxdb/wal" cache-max-memory-size = "4g"

[http] enabled = true bind-address = ":8086" auth-enabled = false log-enabled = true write-tracing = false pprof-enabled = false https-enabled = false max-row-limit = 10000 realm = "InfluxDB"

[retention] enabled = true check-interval = "30m"

[subsciber] enabled = true http-timeout = "30s"

[continuous_queries] log-enabled = true enabled = true

It should work fine.

On Tue, Jun 2, 2020 at 8:08 PM aqua-terra notifications@github.com wrote:

@ctmuthu https://github.com/ctmuthu were you able to find the resolution to this issue? i'm having the same problem. Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus/prometheus/issues/4739#issuecomment-637715735, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKESM3ZW3KRGTKS5YRND3LRUU5YPANCNFSM4F3OCWRQ .

-- Thanks & Regards, Muthuraman Chidambaram Contact : +49 1625 93 99 01

aqua-terra commented 4 years ago

@ctmuthu still doesn't work for me. i'm seeing the read request coming through in influxdb but nothing was returned back to prometheus. I'm using grafana to read from prometheus data source. Can you share your remote read configuration if it has changes since the last time you post it? Thanks.

[httpd] 10.3.98.160 - admin [05/Jun/2020:16:25:01 +0000] "POST /api/v1/prom/read?db=prometheus HTTP/1.1" 200 4 "-" "Prometheus/2.18.1" 1ad1911f-a749-11ea-84ce-2265cdc67e1e 148

aqua-terra commented 4 years ago

here's my influxdb config:

reporting-disabled = false bind-address = ":8088"

[meta] dir = "/var/lib/influxdb/meta" retention-autocreate = true logging-enabled = true

[data] dir = "/var/lib/influxdb/data" index-version = "inmem" wal-dir = "/var/lib/influxdb/wal" wal-fsync-delay = "0s" validate-keys = false query-log-enabled = true cache-max-memory-size = 1073741824 cache-snapshot-memory-size = 26214400 cache-snapshot-write-cold-duration = "10m0s" compact-full-write-cold-duration = "4h0m0s" compact-throughput = 50331648 compact-throughput-burst = 50331648 max-series-per-database = 1000000 max-values-per-tag = 100000 max-concurrent-compactions = 0 max-index-log-file-size = 1048576 series-id-set-cache-size = 100 trace-logging-enabled = false tsm-use-madv-willneed = false

[coordinator] write-timeout = "10s" max-concurrent-queries = 0 query-timeout = "0s" log-queries-after = "0s" max-select-point = 0 max-select-series = 0 max-select-buckets = 0

[retention] enabled = true check-interval = "30m0s"

[shard-precreation] enabled = true check-interval = "10m0s" advance-period = "30m0s"

[monitor] store-enabled = true store-database = "_internal" store-interval = "10s"

[subscriber] enabled = true http-timeout = "30s" insecure-skip-verify = false ca-certs = "" write-concurrency = 40 write-buffer-size = 1000

[http] enabled = true bind-address = ":8086" auth-enabled = false log-enabled = true suppress-write-log = false write-tracing = false flux-enabled = false flux-log-enabled = false pprof-enabled = false pprof-auth-enabled = false debug-pprof-enabled = false ping-auth-enabled = false https-enabled = false https-certificate = "/etc/ssl/influxdb.pem" https-private-key = "" max-row-limit = 10000 max-connection-limit = 0 shared-secret = "" realm = "InfluxDB" unix-socket-enabled = false unix-socket-permissions = "0777" bind-socket = "/var/run/influxdb.sock" max-body-size = 25000000 access-log-path = "" max-concurrent-write-limit = 0 max-enqueued-write-limit = 0 enqueued-write-timeout = 30000000000

[logging] format = "auto" level = "info" suppress-logo = false

[[graphite]] enabled = false bind-address = ":2003" database = "graphite" retention-policy = "" protocol = "tcp" batch-size = 5000 batch-pending = 10 batch-timeout = "1s" consistency-level = "one" separator = "." udp-read-buffer = 0

[[collectd]] enabled = false bind-address = ":25826" database = "collectd" retention-policy = "" batch-size = 5000 batch-pending = 10 batch-timeout = "10s" read-buffer = 0 typesdb = "/usr/share/collectd/types.db" security-level = "none" auth-file = "/etc/collectd/auth_file" parse-multivalue-plugin = "split"

[[opentsdb]] enabled = false bind-address = ":4242" database = "opentsdb" retention-policy = "" consistency-level = "one" tls-enabled = false certificate = "/etc/ssl/influxdb.pem" batch-size = 1000 batch-pending = 5 batch-timeout = "1s" log-point-errors = true

[[udp]] enabled = false bind-address = ":8089" database = "udp" retention-policy = "" batch-size = 5000 batch-pending = 10 read-buffer = 0 batch-timeout = "1s" precision = ""

[continuous_queries] log-enabled = true enabled = true query-stats-enabled = false run-interval = "1s"

[tls] min-version = "" max-version = ""

Kampe commented 4 years ago

Seeing some of the same issues here - how do you specify the specific measurement table to utilize within the database you supply for prometheus to read?

aqua-terra commented 4 years ago

@Kampe I gave up on this issue and ended up just using InfluxDB data source directly in Grafana instead of going through prometheus remote read.

ctmuthu commented 4 years ago

Dear All,

Prometheus Config:

Global config

global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. scrape_timeout: 15s # scrape_timeout is set to the global default (10s). remote_read:

(Please use proper indendation)

Influxdb Configuration:

bind-address = ":8088" [meta] dir = "/var/lib/influxdb/meta" retention-autocreate = true logging-enabled = true

[data] dir = "/var/lib/influxdb/data" engine = "tsm1" wal-dir = "/var/lib/influxdb/wal" cache-max-memory-size = "4g" max-series-per-database = 0

[http] enabled = true bind-address = ":8086" auth-enabled = false log-enabled = true write-tracing = false pprof-enabled = false https-enabled = false max-row-limit = 10000 realm = "InfluxDB"

[retention] enabled = true check-interval = "30m"

[subsciber] enabled = true http-timeout = "30s"

[continuous_queries] log-enabled = true enabled = true

I'm using this config for the last couple of months. I had no trouble. Try this config and update here.

On Thu, Aug 27, 2020 at 7:19 PM aqua-terra notifications@github.com wrote:

@Kampe https://github.com/Kampe I gave up on this issue and ended up just using InfluxDB data source directly in Grafana instead of going through prometheus remote read.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus/prometheus/issues/4739#issuecomment-682082654, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKESMYOXPAXWL5Z2IQ367DSC2IQJANCNFSM4F3OCWRQ .

-- Thanks & Regards, Muthuraman Chidambaram Contact : +49 1625 93 99 01

ctmuthu commented 4 years ago

Hello all,

If nothing works. Then I can debug with you together over the weekend.

On Thu, Aug 27, 2020 at 10:24 PM Muthuraman Chidambaram ctmuthu93@gmail.com wrote:

Dear All,

Prometheus Config:

Global config

global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. scrape_timeout: 15s # scrape_timeout is set to the global default (10s). remote_read:

(Please use proper indendation)

Influxdb Configuration:

bind-address = ":8088" [meta] dir = "/var/lib/influxdb/meta" retention-autocreate = true logging-enabled = true

[data] dir = "/var/lib/influxdb/data" engine = "tsm1" wal-dir = "/var/lib/influxdb/wal" cache-max-memory-size = "4g" max-series-per-database = 0

[http] enabled = true bind-address = ":8086" auth-enabled = false log-enabled = true write-tracing = false pprof-enabled = false https-enabled = false max-row-limit = 10000 realm = "InfluxDB"

[retention] enabled = true check-interval = "30m"

[subsciber] enabled = true http-timeout = "30s"

[continuous_queries] log-enabled = true enabled = true

I'm using this config for the last couple of months. I had no trouble. Try this config and update here.

On Thu, Aug 27, 2020 at 7:19 PM aqua-terra notifications@github.com wrote:

@Kampe https://github.com/Kampe I gave up on this issue and ended up just using InfluxDB data source directly in Grafana instead of going through prometheus remote read.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/prometheus/prometheus/issues/4739#issuecomment-682082654, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKESMYOXPAXWL5Z2IQ367DSC2IQJANCNFSM4F3OCWRQ .

-- Thanks & Regards, Muthuraman Chidambaram Contact : +49 1625 93 99 01

-- Thanks & Regards, Muthuraman Chidambaram Contact : +49 1625 93 99 01