influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.97k stars 3.56k forks source link

Slow geo query #25272

Open tyren234 opened 2 months ago

tyren234 commented 2 months ago

Describe Development Experience Issue:

Hi, so I've noticed that spatial querying is extremely slow on my machine and was wondering if that's normal for Influx or am I'm doing something wrong.

I have a dataset of ships' positions from NOAA, one day of which I downloaded and saved in InfluxDB.

Unfortunately when querying the data I get results after extremely long time. I'd want to query data at level=24 and time range of 24h, but as things are now it takes 72s to query ships within a bounding box and a time range of 1 minute.

Steps to reproduce:

  1. Install InfluxDB 2.7.6 from Docker
  2. Download ships' positions from one day
  3. Write them to influx using python InfluxDBClient:
    write_api.write(bucket=bucket_name, org=organization_id,
                        record=df,
                        data_frame_measurement_name="vessels_ais_31_12",
                        data_frame_tag_columns=data_frame_tag_columns,
                        data_frame_timestamp_column="BaseDateTime",
                        )
  4. Create a new bucket with just spatial index and basic information:
    
    level = 24
    for start, end in generate_time_ranges_for_day(day, interval):
    flux_query = f"""
    import "experimental/geo"

from(bucket: "{raw_data_bucket}") |> range(start: {start}, stop: {end}) |> filter(fn: (r) => r._measurement == "vessels_ais_31_12") |> filter(fn: (r) => r._field == "LAT" or r._field == "LON") |> geo.shapeData(latField: "{lat_field_name}", lonField: "{lon_field_name}", level: {level}) |> to (bucket: "{indexed_data_bucket}", tagColumns: ["s2_cell_id", "MMSI"], fieldFn: (r) => ({{"lat": r.lat, "lon": r.lon}})) """

![Screenshot from 2024-08-26 21-18-24](https://github.com/user-attachments/assets/886d59e1-6ebe-4cba-a54a-e283e7e8c324)
5. Query the data:
```python
    region = {{
        minLat: {min_lat},
        maxLat: {max_lat},
        minLon: {min_lon},  
        maxLon: {max_lon},
    }}

    from(bucket: "{bucket}")
        |> range(start: {start_date}, stop: {stop_date})
        |> filter(fn: (r) => r._measurement == "vessels_ais_31_12")
        |> geo.filterRows(region: region, level: {level}, strict: {strict})
    """

Desired result:

Ideal

Returns result from 24h period on level 24 in less than 10 seconds.

Less ideal

Returns ideal result in less than 10 minutes.

Actual result:

Range of just one minute

On 5GB memory: Lvl 10 - 72s
Lvl 11 - 122s
Lvl 12 - 113s
Lvl 13 - 134s
Lvl 14 - 122s
Lvl 15 - 139s
Lvl 16 - 222s
Lvl 17 - 573s
Lvl 18 - timeout after 600 seconds

On 16GB memory: Lvl 10 - 109s
Lvl 11 - 109s
Lvl 12 - 104s
Lvl 13 - 105s
Lvl 14 - 110s
Lvl 15 - 131s
Lvl 16 - 220s
Lvl 17 - 536s
Lvl 18 - timeout after 600 seconds

Hardware Environment:

Operating System:

Laptop:

Ubuntu 22.04.4 LTS x86_64

Docker image:

Linux 36a415262a30 6.4.16-linuxkit #1 SMP PREEMPT_DYNAMIC Fri Nov 10 14:51:57 UTC 2023 x86_64 GNU/Linux

Code Editing Tool:

PyCharm and influx in browser control panel

Build Environment:

Docker. influx:latest. Influx version 2.7.6, CLI vesion 2.7.3.

tyren234 commented 2 months ago

Doesn't work on 2.7.10 either.