influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.52k stars 3.53k forks source link

Very slow Influxql performance (Influxdb 2.6) #24150

Open harmv opened 1 year ago

harmv commented 1 year ago

description

We are evaluating a migration from influxdb 1.8 to 2.6 and we noticed that some very simple queries became very slow, and trigger very high cpu load. This seems a performance regression, and I suspect this is a bug.

The queries that are failing are the type queries of that influxdb is actually designed for. Query for data in a certain (small) time window. Such queries should be very fast.

I reproduced the problem on a small database. Its a database collecting p1 measurements of a device (~ 1 insert second, 1 device).

I query for the latest data. (show all data for the last 5 seconds). Using influxql I get very bad performance.

Please advise.

Refiling

I did report this performance issue, on the Influx Community site here. But from the replies I derived that that is not the proper place to report such problems. Therefor I refile the same issue here. Let me know this is not the proper location to file such issues.

Steps to reproduce

influxql query (5s range) select * from p1_values where time > '2023-03-15T15:10:25+00:00' and time <= '2023-03-15T15:10:30+00:00';

results

1 queries took 0.124614s (returned 5 rows), (this is a query done on localhost, no network overhead) Running this periodically, (2 times per second) (over a persistent connection to the database) triggers a significant cpu load on the database server (cpu consumed by influxd process).

Expected behavior

I’d expect such query to finish in the 1-2 ms range, not > 120ms. On influxdb 1.8 similar queries take < 1 ms on a much larger database. (~200GB)

database

small database

# du -sh /var/lib/influxdb/engine/data/
2.2G    /var/lib/influxdb/engine/data/

I have a low cardinality dataset.

influxdb.cardinality(bucket: "energy", start: -1y)

gives

Result: _result
Table: keys: []
                _value:int
--------------------------
                         4
show tag keys

┏━━━━━━━┳━━━━━━━━┓
┃ index ┃ tagKey ┃
┣━━━━━━━╋━━━━━━━━┫
┃      1┃region  ┃
┃      2┃sensor  ┃
┣━━━━━━━┻━━━━━━━━┫
┃    2 Columns, 2┃
┃  Rows, Page 1/1┃
┃      Table 1/1,┃
┃   Statement 1/1┃
┗━━━━━━━━━━━━━━━━┛
show field keys

Name: p1_values
┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ index ┃                fieldKey                ┃ fieldType ┃
┣━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━┫
┃      1┃CURRENT_ELECTRICITY_DELIVERY            ┃float      ┃
┃      2┃CURRENT_ELECTRICITY_USAGE               ┃float      ┃
┃      3┃DEVICE_TYPE                             ┃float      ┃
┃      4┃ELECTRICITY_DELIVERED_TARIFF_1          ┃float      ┃
┃      5┃ELECTRICITY_DELIVERED_TARIFF_2          ┃float      ┃
┃      6┃ELECTRICITY_USED_TARIFF_1               ┃float      ┃
┃      7┃ELECTRICITY_USED_TARIFF_2               ┃float      ┃
┃      8┃HOURLY_GAS_METER_READING                ┃float      ┃
┃      9┃INSTANTANEOUS_ACTIVE_POWER_L1_NEGATIVE  ┃float      ┃
┃     10┃INSTANTANEOUS_ACTIVE_POWER_L1_POSITIVE  ┃float      ┃
┃     11┃INSTANTANEOUS_ACTIVE_POWER_L2_NEGATIVE  ┃float      ┃
┃     12┃INSTANTANEOUS_ACTIVE_POWER_L2_POSITIVE  ┃float      ┃
┃     13┃INSTANTANEOUS_ACTIVE_POWER_L3_NEGATIVE  ┃float      ┃
┃     14┃INSTANTANEOUS_ACTIVE_POWER_L3_POSITIVE  ┃float      ┃
┃     15┃INSTANTANEOUS_CURRENT_L1                ┃float      ┃
┃     16┃INSTANTANEOUS_CURRENT_L2                ┃float      ┃
┃     17┃INSTANTANEOUS_CURRENT_L3                ┃float      ┃
┃     18┃INSTANTANEOUS_VOLTAGE_L1                ┃float      ┃
┃     19┃INSTANTANEOUS_VOLTAGE_L2                ┃float      ┃
┃     20┃INSTANTANEOUS_VOLTAGE_L3                ┃float      ┃
┃     21┃LONG_POWER_FAILURE_COUNT                ┃float      ┃
┃     22┃SHORT_POWER_FAILURE_COUNT               ┃float      ┃
┃     23┃VOLTAGE_SAG_L1_COUNT                    ┃float      ┃
┃     24┃VOLTAGE_SAG_L2_COUNT                    ┃float      ┃
┃     25┃VOLTAGE_SAG_L3_COUNT                    ┃float      ┃
┃     26┃VOLTAGE_SWELL_L1_COUNT                  ┃float      ┃
┃     27┃VOLTAGE_SWELL_L2_COUNT                  ┃float      ┃
┃     28┃VOLTAGE_SWELL_L3_COUNT                  ┃float      ┃
┣━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━┫
┃                                3 Columns, 28 Rows, Page 1/1┃
┃                                    Table 1/1, Statement 1/1┃

build-tsi

I’ve tried rebuilding the indexes using influxd inspect build-tsi. This did not have any (positive nor negative) effect on the performance.

version

InfluxDB v2.6.1 (git: 9dcf880fe0) build_date: 2022-12-29T15:53:07Z on Ubuntu 20.04.6 LTS \n \l

Queries executed on localhost using python script using client influxdb==5.3.1 from influxdb import InfluxDBClient

Unfortunately running the query on the CLI (influx v1 shell) does not show how long the query takes. (It shows the result, but not the duration, so I couldn’t test without a python client) On influxdb 1.8 our python clients perform just fine. (as mentioned above, <1ms query times)

Performance

Find attached the generated profiles iostat.txt profiles.tar.gz

Whats further interesting too see, is that the cpu load triggered by this particular query, behaves like a saw tooth in a 24h window.

image

NB: cpu graph is from a dedicated test system that only runs this particular query periodically (+ 1 insert/sec load) to investigate this problem.

jeffreyssmith2nd commented 1 year ago

A couple of questions:

would translate to

from(bucket: "p1_values")
  |> range(start: '2023-03-15T15:10:25+00:00', stop: '2023-03-15T15:10:30+00:00')

(note, start is inclusive and stop is exclusive so you might see a slightly different set of points)

harmv commented 1 year ago

flux <-> influxql

Flux seems to perform just as bad.

Running these queries(1 flux and 1 influxql) every 2.5 s results in a 20% cpu load

influxql query took 0.35062s (4 rows)
flux query took 0.433571s (28 measurements, each 5 records)
influxql query took 0.234293s (4 rows)
flux query took 0.310905s (28 measurements, each 4 records)
influxql query took 0.202471s (5 rows)
flux query took 0.358151s (28 measurements, each 4 records)
influxql query took 0.149741s (5 rows)
flux query took 0.215236s (28 measurements, each 5 records)

These queries lookup data of the last 5 second: start: now -5s, end: now so every next query is different from the previous one. They run each once every 2.5s seconds.

When I stop the script the server goes back to 99% idle. (just doing its 1 insert/sec)

Expected behaviour

size machine

No, the machines are not comparable. (This particular vps is smaller, its on a Linode 2 GB VPS, 1cpu). So you can't compare them directly. however... the order or magnitude seems wrong.

I started these tests on this VPS just to isolate and analyze this particular performance issue.
As I'm doing (for a timescale database) a typical query, a query that influxdb is specifically designed for... These queries fired every 2.5s (which is low freq) should not trigger an noticeable load (for any database) IMHO.

Or is this a too small machine to test on? (2GB, 1cpu). (The test database is pretty small)

# free
              total        used        free      shared  buff/cache   available
Mem:        2005820      555780      241380      493596     1208660      780448
Swap:        524284      399836      124448

Did the iostat.txt, profiles.tar.gz give any insights?

[EDIT] Added chapter blow as my clock just passed 19:00-19:15u (unsure when exactly)

Saw tooth

There seems to be something weird going on. The performance varies over time. (See the cpu graph in my initial comment)'

After 19:00-19:15 localtime, performance seems ok.

influxql query took 0.012164s (5 rows)
flux query took 0.0946983s (28 measurements, each 5 records)
influxql query took 0.0105179s (4 rows)
flux query took 0.0594368s (28 measurements, each 4 records)
influxql query took 0.00680626s (5 rows)
flux query took 0.0719373s (28 measurements, each 5 records)

These are times that I'd expect. If these where consistent, then there is no problem.. however....

These queries become slower and slower, until around the next day ~18:00u then they are worst. (See above the 200-400ms response times).

Note: the machine is doing nothing else just testing influxdb. When the machines consumes cpu its all in the influxd process. (according to top)

Is there anything in influxdb that would explains this 24h periodicity? thinking out loud: Some wrap to a new datafile it the storage? I do see a matching timestamp it the data directory

ls -l /var/lib/influxdb/engine/data/97b908d30850795b/autogen/
       ... <snipped>
drwxr-x--- 3 influxdb influxdb 4096 Mar 13 04:10 169
drwxr-x--- 3 influxdb influxdb 4096 Mar 20 04:10 186
drwxr-x--- 3 influxdb influxdb 4096 Mar 20 18:13 203           <---- thats UTC that 19:00/19:15 localtime

side observation: (not related to reported issue) the influxql queries seem faster in this set than the (same) flux queries, is that expected?

bucket info

root@linode1:~# influx bucket list -o XXX
ID          Name        Retention   Shard group duration    Organization ID     Schema Type
3e1763f95061aed5    _monitoring 168h0m0s    24h0m0s         6144dcaee51b2d34    implicit
15879a95c8811b45    _tasks      72h0m0s     24h0m0s         6144dcaee51b2d34    implicit
97b908d30850795b    energy      8760h0m0s   168h0m0s        6144dcaee51b2d34    implicit
3f234d72a5f484cd    energy_day  infinite    168h0m0s        6144dcaee51b2d34    implicit
2fd1ca03bef2561a    energy_hour infinite    168h0m0s        6144dcaee51b2d34    implicit

energy is the bucket that I query in my tests.

jeffreyssmith2nd commented 1 year ago

Thanks for calling out the profiles, I missed those the first time. I assume those profiles were taken will seeing the slow queries?

The profile shows most of the time spent in Deduplicate, are you writing points frequently that are overlapping in time? That would explain the saw tooth nature, if you're hitting a retention boundary then it would be faster at the start then slowing down as more and more duplicates piled in.

harmv commented 1 year ago

Yes, those profiles where generated while seeing the problem.

Deduplicate? No I don't expect any duplicates at all. The p1 device reports one measurement per second., (~ one POST /write?db=energy per second).

It sends measurements over http, it sends measurements without a timestamp, so influxd attaches a timestamp itself. Hence there cannot be any duplicates, nor overlapping points, as the measurements have 1s gaps. right?

Is the fact that influxd creates the timestamp, maybe causing/triggering this problem?

According to the (1.8) docs it should be allowed: "Timestamp: Supplied at the end of the line in Unix time in nanoseconds since January 1, 1970 UTC - is optional. If you do not specify a timestamp, InfluxDB uses the server’s local nanosecond timestamp in Unix epoch. Time in InfluxDB is in UTC format by default."

EDIT: influxd creating the timestamp does not cause/trigger this problem. When adding a timestamp myself (instead of influxd) performance stays the same.

note: there is a low influx of data. There is just this single (test) device, firing writes at 1Hz.

See attached nginx logging (with post bodies) (no timestamps in requests)

 - energy [22/Mar/2023:07:52:05 +0000] "POST /write?db=energy HTTP/1.1" 204 0 "-" "python-requests/2.28.1" 
p1_values,region=brabant,sensor=337891 CURRENT_ELECTRICITY_DELIVERY=0.0,CURRENT_ELECTRICITY_USAGE=0.376,DEVICE_TYPE=3.0,ELECTRICITY_DELIVERED_TARIFF_1=0.0,ELECTRICITY_DELIVERED_TARIFF_2=0.0,ELECTRICITY_USED_TARIFF_1=7153.472,ELECTRICITY_USED_TARIFF_2=6722.854,HOURLY_GAS_METER_READING=2319.383,INSTANTANEOUS_ACTIVE_POWER_L1_NEGATIVE=0.0,INSTANTANEOUS_ACTIVE_POWER_L1_POSITIVE=0.157,INSTANTANEOUS_ACTIVE_POWER_L2_NEGATIVE=0.0,INSTANTANEOUS_ACTIVE_POWER_L2_POSITIVE=0.103,INSTANTANEOUS_ACTIVE_POWER_L3_NEGATIVE=0.0,INSTANTANEOUS_ACTIVE_POWER_L3_POSITIVE=0.115,INSTANTANEOUS_CURRENT_L1=1.0,INSTANTANEOUS_CURRENT_L2=0.0,INSTANTANEOUS_CURRENT_L3=0.0,INSTANTANEOUS_VOLTAGE_L1=227.0,INSTANTANEOUS_VOLTAGE_L2=227.0,INSTANTANEOUS_VOLTAGE_L3=226.0,LONG_POWER_FAILURE_COUNT=99.0,SHORT_POWER_FAILURE_COUNT=56.0,VOLTAGE_SAG_L1_COUNT=22.0,VOLTAGE_SAG_L2_COUNT=17.0,VOLTAGE_SAG_L3_COUNT=64.0,VOLTAGE_SWELL_L1_COUNT=24.0,VOLTAGE_SWELL_L2_COUNT=27.0,VOLTAGE_SWELL_L3_COUNT=37.0\x0A

 - energy [22/Mar/2023:07:52:06 +0000] "POST /write?db=energy HTTP/1.1" 204 0 "-" "python-requests/2.28.1" 
p1_values,region=brabant,sensor=337891 CURRENT_ELECTRICITY_DELIVERY=0.0,CURRENT_ELECTRICITY_USAGE=0.377,DEVICE_TYPE=3.0,ELECTRICITY_DELIVERED_TARIFF_1=0.0,ELECTRICITY_DELIVERED_TARIFF_2=0.0,ELECTRICITY_USED_TARIFF_1=7153.472,ELECTRICITY_USED_TARIFF_2=6722.855,HOURLY_GAS_METER_READING=2319.383,INSTANTANEOUS_ACTIVE_POWER_L1_NEGATIVE=0.0,INSTANTANEOUS_ACTIVE_POWER_L1_POSITIVE=0.156,INSTANTANEOUS_ACTIVE_POWER_L2_NEGATIVE=0.0,INSTANTANEOUS_ACTIVE_POWER_L2_POSITIVE=0.107,INSTANTANEOUS_ACTIVE_POWER_L3_NEGATIVE=0.0,INSTANTANEOUS_ACTIVE_POWER_L3_POSITIVE=0.114,INSTANTANEOUS_CURRENT_L1=1.0,INSTANTANEOUS_CURRENT_L2=0.0,INSTANTANEOUS_CURRENT_L3=0.0,INSTANTANEOUS_VOLTAGE_L1=227.0,INSTANTANEOUS_VOLTAGE_L2=227.0,INSTANTANEOUS_VOLTAGE_L3=227.0,LONG_POWER_FAILURE_COUNT=99.0,SHORT_POWER_FAILURE_COUNT=56.0,VOLTAGE_SAG_L1_COUNT=22.0,VOLTAGE_SAG_L2_COUNT=17.0,VOLTAGE_SAG_L3_COUNT=64.0,VOLTAGE_SWELL_L1_COUNT=24.0,VOLTAGE_SWELL_L2_COUNT=27.0,VOLTAGE_SWELL_L3_COUNT=37.0\x0A
harmv commented 1 year ago

NB: the 24h periodicity is not accurate, is actually more an ~18h periodicity. (I misread the graphs)

harmv commented 1 year ago

For comparison I started up an instance of timescaledb (on the same machine), loaded it with the same historical dataset, inserting the same data (1 insert/sec) and periodically fire the same (similar... its sql) query.

All running simultaneously shows that:

I do expect that we see some low hanging perf issue/bug here on the influxdb part here.

NB: all tests are run simultaneously, on the same machine.

@jeffreyssmith2nd: Let me know, what I can further test. Would additional profile runs help?

harmv commented 1 year ago

Upgraded today. Same perf on 2.7.0, no improvements.

[edit]: same on 2.7.1-1