taosdata / TDengine

High-performance, scalable time-series database designed for Industrial IoT (IIoT) scenarios
https://tdengine.com
GNU Affero General Public License v3.0
23.37k stars 4.86k forks source link

Prometheus remote write #17038

Closed LeoDemon closed 5 months ago

LeoDemon commented 2 years ago

Bug Description

Prometheus remote write error after insert 1 million records and high machine usage(cpu/memory) of taosd/taosadapter process.

To Reproduce

  1. deploy taosd on hostA and startup.
  2. deploy taosadapter on hostA and startup.
  3. prometheus and some exporters (node_exporter, cadvisor, etc.) have been running on hostB for several weeks (via docker container), and config prometheus storage.tsdb.retention.time=1d.
  4. modify prometheus config for remote read/write:
    
    remote_write:
    - url: "http://hostA:6041/prometheus/v1/remote_write/prometheus_data"
    basic_auth:
      username: root
      password: taosdata

remote_read:

Server is Community Edition.

taos> show databases; name |

information_schema | performance_schema | demo_dev | prometheus_data | Query OK, 4 rows in database (0.001345s)

taos> use prometheus_data; Database changed.

taos> show stables; stable_name |

metrics | Query OK, 1 rows in database (0.001225s)

taos> select count() from metrics; count() |

           1159251 |

Query OK, 1 rows in database (0.081566s)

taos>

```shell
[root@1012708 taos]# ll
total 12
drwxr-xr-x 2 root root 4096 Sep 22 15:55 dnode
drwxr-xr-x 6 root root 4096 Sep 23 11:52 mnode
drwxr-xr-x 6 root root 4096 Sep 22 16:24 vnode
[root@1012708 taos]# du -sh
18M     .
[root@1012708 taos]# 
  1. then remote write error, see logs below:
    
    # prometheus error logs:

... ... level=info ts=2022-09-22T08:24:31.912Z caller=main.go:818 msg="TSDB started" level=info ts=2022-09-22T08:24:31.912Z caller=main.go:944 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml ts=2022-09-22T08:24:31.913Z caller=dedupe.go:112 component=remote level=info remote_name=ad36f9 url=http://hostA:6041/prometheus/v1/remote_write/prometheus_data msg="Starting WAL watcher" queue=ad36f9 ts=2022-09-22T08:24:31.913Z caller=dedupe.go:112 component=remote level=info remote_name=ad36f9 url=http://hostA:6041/prometheus/v1/remote_write/prometheus_data msg="Starting scraped metadata watcher" ts=2022-09-22T08:24:31.913Z caller=dedupe.go:112 component=remote level=info remote_name=ad36f9 url=http://hostA:6041/prometheus/v1/remote_write/prometheus_data msg="Replaying WAL" queue=ad36f9 level=info ts=2022-09-22T08:24:31.924Z caller=main.go:975 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml totalDuration=12.841996ms remote_storage=553.029µs web_handler=440ns query_engine=744ns scrape=259.402µs scrape_sd=125.461µs notify=24.508µs notify_sd=14.147µs rules=10.946317ms level=info ts=2022-09-22T08:24:31.924Z caller=main.go:767 msg="Server is ready to receive web requests." ts=2022-09-22T08:24:36.997Z caller=dedupe.go:112 component=remote level=info remote_name=ad36f9 url=http://hostA:6041/prometheus/v1/remote_write/prometheus_data msg="Done replaying WAL" duration=5.083726968s level=info ts=2022-09-22T09:00:01.069Z caller=compact.go:507 component=tsdb msg="write block" mint=1663826400000 maxt=1663833600000 ulid=01GDJ6SRGFZB9Y769JBP371XXN duration=158.068357ms level=info ts=2022-09-22T09:00:01.076Z caller=db.go:1230 component=tsdb msg="Deleting obsolete block" block=01GDFV8SV6K0YX9AW4GGKR0CBQ level=info ts=2022-09-22T09:00:01.084Z caller=head.go:880 component=tsdb msg="Head GC completed" duration=8.251081ms level=info ts=2022-09-22T09:00:01.097Z caller=checkpoint.go:95 component=tsdb msg="Creating checkpoint" from_segment=282 to_segment=286 mint=1663833600000 level=info ts=2022-09-22T09:00:01.201Z caller=head.go:977 component=tsdb msg="WAL checkpoint complete" first=282 last=286 duration=103.818061ms ts=2022-09-22T09:06:23.119Z caller=dedupe.go:112 component=remote level=warn remote_name=ad36f9 url=http://hostA:6041/prometheus/v1/remote_write/prometheus_data msg="Failed to send batch, retrying" err="Post \"http://hostA:6041/prometheus/v1/remote_write/prometheus_data\": context deadline exceeded" ts=2022-09-22T09:07:23.282Z caller=dedupe.go:112 component=remote level=warn remote_name=ad36f9 url=http://hostA:6041/prometheus/v1/remote_write/prometheus_data msg="Failed to send batch, retrying" err="Post \"http://hostA:6041/prometheus/v1/remote_write/prometheus_data\": context deadline exceeded" ts=2022-09-22T09:08:23.484Z caller=dedupe.go:112 component=remote level=warn remote_name=ad36f9 url=http://hostA:6041/prometheus/v1/remote_write/prometheus_data msg="Failed to send batch, retrying" err="Post \"http://hostA:6041/prometheus/v1/remote_write/prometheus_data\": context deadline exceeded" ts=2022-09-22T09:09:23.686Z caller=dedupe.go:112 component=remote level=warn remote_name=ad36f9 url=http://hostA:6041/prometheus/v1/remote_write/prometheus_data msg="Failed to send batch, retrying" err="Post \"http://hostA:6041/prometheus/v1/remote_write/prometheus_data\": context deadline exceeded" .... .... ....

taosadapter error logs:

.... .... 09/22 17:05:20.514168 00032544 TAOS_ADAPTER info "| 202 | 294.24148ms | 172.16.0.171 | POST | /prometheus/v1/remote_read/prometheus_data " sessionID=16988 model=web 09/22 17:05:20.801576 00032544 TAOS_ADAPTER info "| 202 | 286.039756ms | 172.16.0.171 | POST | /prometheus/v1/remote_read/prometheus_data " model=web sessionID=16989 09/22 17:05:32.068881 00032544 TAOS_ADAPTER error "

^[[31m2022/09/22 17:05:32 write tcp 172.16.0.217:6041->172.16.0.171:32802: write: broken pipe POST /prometheus/v1/remote_read/prometheus_data HTTP/1.1^M Host: hostA:6041^M Accept-Encoding: snappy^M Authorization: ^M Content-Encoding: snappy^M Content-Length: 165^M Content-Type: application/x-protobuf^M User-Agent: Prometheus/2.26.0^M X-Prometheus-Remote-Read-Version: 0.1.0^M ^M ^[[0m " model=web id=16990 09/22 17:05:32.070565 00032544 TAOS_ADAPTER info "| 202 | 11.26762798s | 172.16.0.171 | POST | /prometheus/v1/remote_read/prometheus_data " model=web sessionID=16990 09/22 17:05:32.071968 00032544 TAOS_ADAPTER info "| 202 | 410.972µs | 172.16.0.171 | POST | /prometheus/v1/remote_write/prometheus_data " model=web sessionID=16996 09/22 17:05:38.092067 00032544 TAOS_ADAPTER info "| 202 | 16.381650639s | 172.16.0.171 | POST | /prometheus/v1/remote_write/prometheus_data " model=web sessionID=16991 ... ... 09/22 17:35:31.917105 00032544 TAOS_ADAPTER info "| 202 | 302.912µs | 172.16.0.171 | POST | /prometheus/v1/remote_write/prometheus_data " model=web sessionID=18421 09/22 17:36:31.917862 00032544 TAOS_ADAPTER info "| 202 | 337.44µs | 172.16.0.171 | POST | /prometheus/v1/remote_write/prometheus_data " model=web sessionID=18466 09/22 17:37:31.916853 00032544 TAOS_ADAPTER info "| 202 | 238.442µs | 172.16.0.171 | POST | /prometheus/v1/remote_write/prometheus_data " model=web sessionID=18512 09/22 17:38:31.108564 00032544 TAOS_ADAPTER error "select metrics.,tbname from metrics where ts >= '2022-09-22T08:56:01.175Z' and ts <= '2022-09-22T09:06:01.175Z' and labels->'instance' match '$server:.*' and labels->'name' = 'node_load1' and labels->'cluster' = 'executor-server-test'" model=prometheus error=[0x2501] scheduler status error 09/22 17:38:31.917356 00032544 TAOS_ADAPTER info "| 202 | 555.607µs | 172.16.0.171 | POST | /prometheus/v1/remote_write/prometheus_data " model=web sessionID=18557 09/22 17:38:37.282855 00032544 TAOS_ADAPTER error "

^[[31m2022/09/22 17:38:37 write tcp 172.16.0.217:6041->172.16.0.171:34974: write: broken pipe POST /prometheus/v1/remote_read/prometheus_data HTTP/1.1^M Host: hostA:6041^M Accept-Encoding: snappy^M Authorization: *^M Content-Encoding: snappy^M Content-Length: 143^M Content-Type: application/x-protobuf^M User-Agent: Prometheus/2.26.0^M X-Prometheus-Remote-Read-Version: 0.1.0^M ^M ^[[0m " model=web id=17067 09/22 17:38:37.282896 00032544 TAOS_ADAPTER info "| 202 | 32m44.712724231s | 172.16.0.171 | POST | /prometheus/v1/remote_read/prometheus_data " model=web sessionID=17067 09/22 17:38:38.075700 00032544 TAOS_ADAPTER info "| 202 | 31m44.892738996s | 172.16.0.171 | POST | /prometheus/v1/remote_write/prometheus_data " model=web sessionID=17117 09/22 17:38:38.076164 00032544 TAOS_ADAPTER error "

^[[31m2022/09/22 17:38:38 write tcp 172.16.0.217:6041->172.16.0.171:34964: write: broken pipe POST /prometheus/v1/remote_read/prometheus_data HTTP/1.1^M Host: hostA:6041^M Accept-Encoding: snappy^M Authorization: ^M Content-Encoding: snappy^M Content-Length: 142^M Content-Type: application/x-protobuf^M User-Agent: Prometheus/2.26.0^M X-Prometheus-Remote-Read-Version: 0.1.0^M ^M ... ... 09/22 17:40:59.072895 00032544 TAOS_ADAPTER info "| 202 | 25.977438038s | 172.16.0.171 | POST | /prometheus/v1/remote_write/prometheus_data " model=web sessionID=18653 09/22 17:42:51.083694 00032544 TAOS_ADAPTER error "select metrics.,tbname from metrics where ts >= '2022-09-22T03:07:48.518Z' and ts <= '2022-09-22T09:07:48.518Z' and labels->'fstype' != '' and labels->'job' = 'node-exporter' and labels->'name' = 'node_filesystem_readonly' and labels->'cluster' = 'executor-server-test'" model=prometheus error=[0x2501] scheduler status error 09/22 17:42:51.083788 00032544 TAOS_ADAPTER info "| 400 | 30m17.933627094s | 172.16.0.171 | POST | /prometheus/v1/remote_read/prometheus_data " model=web sessionID=17375 09/22 17:42:51.084049 00032544 TAOS_ADAPTER error "select metrics.,tbname from metrics where ts >= '2022-09-22T09:03:18.808Z' and ts <= '2022-09-22T09:08:18.808Z' and labels->'name' != '' and labels->'name' = 'container_spec_memory_limit_bytes' and labels->'cluster' = 'executor-server-test'" model=prometheus error=[0x2501] scheduler status error 09/22 17:42:52.064444 00032544 TAOS_ADAPTER info "| 400 | 33m49.474018172s | 172.16.0.171 | POST | /prometheus/v1/remote_read/prometheus_data " model=web sessionID=17213 09/22 17:58:07.105048 00032544 TAOS_ADAPTER error "select metrics.,tbname from metrics where ts >= '2022-09-22T09:02:54.146Z' and ts <= '2022-09-22T09:07:54.146Z' and labels->'name' = 'up' and labels->'cluster' = 'executor-server-test'" model=prometheus error=[0x2501] scheduler status error 09/22 17:58:07.105135 00032544 TAOS_ADAPTER info "| 400 | 50m2.939550977s | 172.16.0.171 | POST | /prometheus/v1/remote_read/prometheus_data " model=web sessionID=17170 09/22 17:58:07.105340 00032544 TAOS_ADAPTER error "select metrics.*,tbname from metrics where ts >= '2022-09-22T09:07:18.808Z' and ts <= '2022-09-22T09:12:18.808Z' and labels->'name' != '' and labels->'name' = 'container_cpu_cfs_throttled_periods_total' and labels->'cluster' = 'executor-server-test'" model=prometheus error=[0x2501] scheduler status error 09/22 17:58:08.067337 00032544 TAOS_ADAPTER info "| 400 | 44m55.448291031s | 172.16.0.171 | POST | /prometheus/v1/remote_read/prometheus_data " model=web sessionID=17403 ... ... ...


**Expected Behavior**
continuous insertion 

**Screenshots**
![](https://github.com/LeoDemon/imgbox/blob/main/taosd-and-taosadapter.png?raw=true)

**Environment:**
 - OS:  CentOS Linux release 7.5.1804 (Core) , Linux 3.10.0-862.el7.x86_64
 - Intel(R) Xeon(R) Gold 6278C CPU @ 2.60GHz  ,  8C 16 G
 - taosd community version: 3.0.1.1 compatible_version: 3.0.0.0
  ```shell
  [root@1012708 taos]# taosadapter --version
  taosadapter v_branch_HEAD-71e7ccf
  [root@1012708 taos]# 

Additional Context 2022-09-22 logs of taosd not found, I don't know why.

[root@1012708 taos]# pwd
/var/log/taos
[root@1012708 taos]# 
[root@1012708 taos]# ll
total 652432
-rw-r--r-- 1 root root   3416031 Sep 22 19:40 taosadapter_2022_09_22_00_00.log
-rw-r--r-- 1 root root      1065 Sep 23 09:00 taosadapter_2022_09_23_00_00.log
-rwxrwxrwx 1 root root 651403287 Sep 23 11:52 taosdlog.0
-rwxrwxrwx 1 root root   1385292 Sep 23 17:11 taosdlog.1
-rwxrwxrwx 1 root root  11815977 Sep 23 16:58 taoslog0.0
-rwxrwxrwx 1 root root      7498 Sep 22 16:57 taoslog1.0
-rwxrwxrwx 1 root root     31480 Sep 23 11:52 udfdlog.0
[root@1012708 taos]# head taosdlog.0
09/23 06:35:03.825698 00031917 MND queries updated in conn 2431803480, num:0
09/23 06:35:03.854233 00031917 MND queries updated in conn 1115368117, num:0
09/23 06:35:03.854263 00031917 MND stb:1.prometheus_data.metrics, start to retrieve meta
09/23 06:35:03.854276 00031917 MND queries updated in conn 2340104777, num:0
09/23 06:35:03.854291 00031917 MND queries updated in conn 2642094640, num:0
09/23 06:35:03.854306 00031917 MND queries updated in conn 2994416165, num:0
09/23 06:35:03.854320 00031917 MND queries updated in conn 3397069456, num:0
09/23 06:35:03.854346 00031917 MND queries updated in conn 3682282182, num:0
09/23 06:35:03.854361 00031917 MND queries updated in conn 4001049336, num:0
09/23 06:35:03.854376 00031917 MND queries updated in conn 41626126, num:0
[root@1012708 taos]# 
yu285 commented 5 months ago

The bugs in earlier versions of 3.0 have been fixed, but 3.0.2.5 May not be compatible with the current version (3.3.0.0), so it is not possible to upgrade directly. Therefore, it is recommended to redeploy the latest 3.3.0.0, the subsequent upgrade is very convenient, directly install the latest version of the software.

Range of open source version of the operating system support: https://docs.taosdata.com/reference/support-platform/ The bugs from earlier versions of 3.0 have been fixed, but 3.0.2.5 May not be compatible with the current version (3.3.0.0), so it is not possible to upgrade directly. Therefore, it is recommended to redeploy the latest 3.3.0.0, the subsequent upgrade is very convenient, directly install the latest version of the software.

Range of open source version of the operating system support: https://docs.taosdata.com/reference/support-platform/