openaq / openaq-api-v2

OpenAQ API
https://api.openaq.org
43 stars 9 forks source link

Latest endpoint is not sorting by distance correctly #112

Closed caparker closed 2 years ago

caparker commented 2 years ago

When requesting data from the latest endpoint the results are returned in a seemingly arbitrary order

For example https://api.openaq.org/v2/locations?coordinates=45.512230,-122.658722&radius=2000&order_by=distance&limit=10&sort=asc

caparker commented 2 years ago

I think it boils down to the row_number window function shown here. This is the statement from the debugger, with some things we dont need commented out.

\set lon -122.658722
\set lat 45.512230
\set radius 5000
\set limit 10
\set offset 0
WITH t1 AS (
            SELECT
                id,
                name,
                --"sensorType",
                --entity,
                --"isMobile",
                --"isAnalysis",
                --city,
                --country,
                --sources,
                --manufacturers,
                --case WHEN "isMobile" then null else coordinates end as coordinates,
                --measurements,
                "firstUpdated",
                "lastUpdated",
                --json as "rawData",
                --geog,
                --bounds,
                --parameters,
                st_distance(st_makepoint(:lon,:lat)::geography, geog) as dist,
                row_number() over () as row
            FROM locations_base_v2
            WHERE
             st_dwithin(st_makepoint(:lon, :lat)::geography, geog, :radius)  AND  id not in (61485,61505,61506)
            --ORDER BY st_distance(st_makepoint(:lon,:lat)::geography, geog) desc nulls last
            LIMIT :limit
            OFFSET :offset
        ),
        nodes AS (
            SELECT count(distinct id) as nodes
            FROM locations_base_v2
            WHERE
             st_dwithin(st_makepoint(:lon, :lat)::geography, geog, :radius)  AND  id not in (61485,61505,61506)
        ),
        t2 AS (
        SELECT
        *
        --row,
        --jsonb_strip_nulls(
        --    to_jsonb(t1) - '{rawData,source_name,geog, row}'::text[]
        --) as json
        FROM t1
        --group by row, t1, json
        )
        SELECT *
        --nodes as count
        --, json
        FROM t2
        --, nodes
        ORDER BY row;

The row_number is not doing what is expected because the order is not explicitly provided to the window function This results in something that looks like this

  id   |                 name                  |      firstUpdated      |      lastUpdated       |     dist      | row 
--------+---------------------------------------+------------------------+------------------------+---------------+-----
   1064 | Portland - SE Lafaye                  | 2016-03-10 08:00:00+00 | 2022-06-27 14:00:00+00 | 4695.72085254 |   1
  69252 | Eastmoreland                          | 2021-01-19 23:59:05+00 | 2021-10-28 15:29:50+00 |  4732.6373918 |   9
  66710 | NE 54th Outdoor                       | 2021-01-20 00:00:09+00 | 2021-09-10 16:14:10+00 | 4671.45320776 |  11
  68138 | SEL                                   | 2021-01-19 23:59:08+00 | 2022-06-27 14:41:59+00 | 4692.63867283 |  17
  64146 | Bridlemile                            | 2021-01-20 00:00:13+00 | 2022-06-27 14:41:31+00 |  4985.0287048 |  25
 227197 | Eastmoreland                          | 2021-05-31 17:05:20+00 | 2022-06-27 14:41:21+00 | 4690.67850493 |  27
  64112 | NE Regents Drive                      | 2021-01-19 23:59:24+00 | 2022-06-27 14:42:53+00 | 4717.71691655 |  48
 235191 | 738 Sellwood Blvd, Portland, OR 97202 | 2021-09-08 18:33:22+00 | 2022-06-27 14:42:17+00 | 4695.70631895 |  62
 270939 | Eastmoreland - Tolman                 | 2022-01-20 16:28:37+00 | 2022-06-27 14:41:35+00 | 4736.50913972 |  73
 229739 | Fiver                                 | 2021-07-17 23:14:12+00 | 2022-06-27 14:41:32+00 | 4845.62890587 |  74
(10 rows)

Where the row_numbers are obviously not what we want. Removing the order by row will fix this

  id   |         name         |      firstUpdated      |      lastUpdated       |     dist      | row 
--------+----------------------+------------------------+------------------------+---------------+-----
   1064 | Portland - SE Lafaye | 2016-03-10 08:00:00+00 | 2022-06-27 14:00:00+00 | 4695.72085254 |   1
 277836 | Ladd's Addition      | 2022-01-23 17:04:59+00 | 2022-06-27 08:32:33+00 |  765.36603846 |   2
 230547 | Patton Rd            | 2021-07-30 19:20:01+00 | 2022-05-30 23:26:11+00 | 4275.30362346 |   3
  69822 | Ardea Apts 4th floor | 2021-01-19 23:59:24+00 | 2022-05-25 16:34:37+00 | 2006.54977501 |   4
 232023 | Home                 | 2021-08-12 23:07:05+00 | 2022-03-25 21:30:59+00 | 4009.31860293 |   5
  73435 | Healy Heights        | 2021-01-19 23:58:56+00 | 2022-03-24 21:31:43+00 | 3973.01820498 |   6
 229251 | 2871 SE Division St  | 2021-07-09 15:55:44+00 | 2022-03-07 15:27:03+00 | 1941.87789715 |   7
  64106 | NW PDX               | 2021-01-19 23:59:54+00 | 2021-12-15 19:27:20+00 | 4215.64193465 |   8
  69252 | Eastmoreland         | 2021-01-19 23:59:05+00 | 2021-10-28 15:29:50+00 |  4732.6373918 |   9
 235562 | NE 54th (Inside)     | 2021-09-11 21:42:07+00 | 2021-09-12 00:13:38+00 | 4665.55044115 |  10
(10 rows)
caparker commented 2 years ago

Removing the order by row does not actually solve the issue because of the json work that is commented out in the statement above. When you look at the results of the full query, again, with some fields commented out, you can see that the json aggregator sorts by the first key

 count |                                                                    json                                                                     
-------+---------------------------------------------------------------------------------------------------------------------------------------------
    18 | {"id": 67465, "name": "PSU_GBRL_Home", "lastUpdated": "2021-12-08T20:22:05+00:00", "firstUpdated": "2021-01-19T23:59:20+00:00"}
    18 | {"id": 70408, "name": "24th and Sherman", "lastUpdated": "2022-06-27T14:42:24+00:00", "firstUpdated": "2021-01-19T23:58:28+00:00"}
    18 | {"id": 218844, "name": "Strand East", "lastUpdated": "2022-06-27T14:41:29+00:00", "firstUpdated": "2021-01-28T18:33:18+00:00"}
    18 | {"id": 224865, "name": "Harrison West 26th floor", "lastUpdated": "2021-04-20T19:07:00+00:00", "firstUpdated": "2021-04-20T18:35:00+00:00"}
    18 | {"id": 233545, "name": "Portland SE 12th and", "lastUpdated": "2022-06-27T14:00:00+00:00", "firstUpdated": "2021-08-26T21:00:00+00:00"}
    18 | {"id": 273423, "name": "Grant and 24th", "lastUpdated": "2022-06-27T14:41:56+00:00", "firstUpdated": "2022-01-20T16:32:17+00:00"}
    18 | {"id": 277490, "name": "Habitatgarden", "lastUpdated": "2022-06-27T14:41:56+00:00", "firstUpdated": "2022-01-20T16:29:39+00:00"}
    18 | {"id": 277836, "name": "Ladd's Addition", "lastUpdated": "2022-06-27T08:32:33+00:00", "firstUpdated": "2022-01-23T17:04:59+00:00"}
    18 | {"id": 290941, "name": "Kat", "lastUpdated": "2022-06-27T14:43:02+00:00", "firstUpdated": "2022-02-18T23:50:09+00:00"}
    18 | {"id": 303816, "name": "SE Harrison & 27th", "lastUpdated": "2022-06-27T14:42:29+00:00", "firstUpdated": "2022-03-27T20:15:14+00:00"}

So in order to fix this so that we can still use the json aggregating method we will need to add the order by statement to the window function as well. For example

row_number() over (ORDER BY st_distance(st_makepoint(:lon,:lat)::geography, geog) asc nulls last) as row

This ensures that the data is sorted by distance and that we can recover this sorting after doing the aggregating later in the query.