isamplesorg / isamples_inabox

Provides functionality intermediate to a collection and central
0 stars 1 forks source link

API endpoint for h3 match counting #237

Closed datadavev closed 1 year ago

datadavev commented 1 year ago

We need an API that can return the number of documents given one or more H3 values.

Something like:

counts_by_h3_indices(h3s:set[str], q:str="*:*") -> dict

The returned dict has keys of the provided h3 values, and values being the number of documents that match the query q and the h3 value.

Note that it will be necessary to compute the appropriate h3 field by getting the resolution of each h3 value (https://uber.github.io/h3-py/api_reference.html#h3.get_resolution).

Not sure if there's a way to do this in solr without simply issuing a query per h3 value.

datadavev commented 1 year ago

For a single resolution, we can use result grouping. e.g., given two resolution=1 H3s:

h3s = [
  "8126fffffffffff",
  "8149bffffffffff"
]
q = "*:*"

request:

params = {
  "q":"producedBy_samplingSite_location_h3_1:(8126fffffffffff 8149bffffffffff)",
  "fl":"id",
  "group":"true",
  "group.limit":0,
  "group.field"="producedBy_samplingSite_location_h3_1"
}

response:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":83,
    "params":{
      "q":"producedBy_samplingSite_location_h3_1:(8126fffffffffff 8149bffffffffff)",
      "fl":"id",
      "group.limit":"0",
      "q.op":"OR",
      "group.field":"producedBy_samplingSite_location_h3_1",
      "_":"1669838164153",
      "group":"true"}},
  "grouped":{
    "producedBy_samplingSite_location_h3_1":{
      "matches":6978,
      "groups":[{
          "groupValue":"8126fffffffffff",
          "doclist":{"numFound":3292,"start":0,"numFoundExact":true,"docs":[]
          }},
        {
          "groupValue":"8149bffffffffff",
          "doclist":{"numFound":3686,"start":0,"numFoundExact":true,"docs":[]
          }}]}}}
datadavev commented 1 year ago

Does not appear that multiple resolutions can be handled without at least one query per resolution.

I'm changing the requirements to supporting a set of H3 values at the same resolution rather than mixed resolutions..

datadavev commented 1 year ago

The above can also be done perhaps more efficiently by faceting:

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":2,
    "params":{
      "q":"producedBy_samplingSite_location_h3_1:(8126fffffffffff 8149bffffffffff)",
      "facet.field":"producedBy_samplingSite_location_h3_1",
      "fl":"id",
      "q.op":"OR",
      "facet.mincount":"1",
      "rows":"0",
      "facet":"true",
      "_":"1669838164153"}},
  "response":{"numFound":6978,"start":0,"numFoundExact":true,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "producedBy_samplingSite_location_h3_1":[
        "8149bffffffffff",3686,
        "8126fffffffffff",3292]},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}
datadavev commented 1 year ago

Counts can be made for multiple resolutions in a single request, however the number of facets in the response can be quite large, depending on the resolutions requested. For two at res=1 and two at res=3:

{
"responseHeader": {
"zkConnected": true,
"status": 0,
"QTime": 4,
"params": {
"q": "producedBy_samplingSite_location_h3_1:(8126fffffffffff 8149bffffffffff) OR producedBy_samplingSite_location_h3_3:(8366f5fffffffff 8366f5fffffffff)",
"facet.field": [
"producedBy_samplingSite_location_h3_1",
"producedBy_samplingSite_location_h3_3"
],
"fl": "id",
"q.op": "OR",
"facet.mincount": "1",
"rows": "0",
"facet": "true"
}
},
"response": {
"numFound": 7064,
"start": 0,
"numFoundExact": true,
"docs": []
},
"facet_counts": {
"facet_queries": {},
"facet_fields": {
"producedBy_samplingSite_location_h3_1": [
"8149bffffffffff",
3686,
"8126fffffffffff",
3292,
"8166fffffffffff",
86
],
"producedBy_samplingSite_location_h3_3": [
"83498bfffffffff",
1036,
"834990fffffffff",
688,
"8326eefffffffff",
543,
"8326e8fffffffff",
451,
"8326e9fffffffff",
341,
"83499cfffffffff",
311,
"8326ebfffffffff",
289,
"8326edfffffffff",
216,
"83499dfffffffff",
213,
"834449fffffffff",
204,
"8326c1fffffffff",
175,
"834999fffffffff",
151,
"8326cafffffffff",
143,
"834991fffffffff",
142,
"8326ccfffffffff",
123,
"8326dcfffffffff",
110,
"8326cdfffffffff",
107,
"8326c3fffffffff",
106,
"834996fffffffff",
104,
"834982fffffffff",
97,
"8349b2fffffffff",
95,
"8326c2fffffffff",
87,
"8349b6fffffffff",
86,
"8366f5fffffffff",
86,
"8326c9fffffffff",
82,
"834995fffffffff",
82,
"834983fffffffff",
74,
"8326c8fffffffff",
65,
"834988fffffffff",
63,
"834994fffffffff",
54,
"83498afffffffff",
53,
"834986fffffffff",
47,
"8326defffffffff",
44,
"83498efffffffff",
44,
"8349adfffffffff",
40,
"834998fffffffff",
33,
"83499afffffffff",
28,
"83489bfffffffff",
26,
"8349a2fffffffff",
26,
"8349a3fffffffff",
24,
"8349b3fffffffff",
23,
"8326c6fffffffff",
22,
"8326cbfffffffff",
20,
"8326e3fffffffff",
20,
"83499efffffffff",
20,
"8349b0fffffffff",
19,
"8326d1fffffffff",
18,
"83499bfffffffff",
18,
"8348a4fffffffff",
15,
"834989fffffffff",
15,
"836d36fffffffff",
14,
"8349a8fffffffff",
13,
"8326cefffffffff",
12,
"8326e5fffffffff",
12,
"8326ecfffffffff",
12,
"8326e2fffffffff",
11,
"834980fffffffff",
11,
"8326ddfffffffff",
10,
"834985fffffffff",
10,
"8326d0fffffffff",
9,
"8349b5fffffffff",
9,
"8349abfffffffff",
8,
"8349aefffffffff",
7,
"8349b1fffffffff",
7,
"8326d9fffffffff",
6,
"8326d5fffffffff",
5,
"8326e4fffffffff",
5,
"8326c5fffffffff",
4,
"8326d3fffffffff",
4,
"8326e0fffffffff",
3,
"8326d6fffffffff",
2,
"8326d8fffffffff",
2,
"83498dfffffffff",
2,
"8349aafffffffff",
2,
"8326c4fffffffff",
1,
"8326d2fffffffff",
1,
"8326f2fffffffff",
1,
"834984fffffffff",
1,
"83498cfffffffff",
1
]
},
"facet_ranges": {},
"facet_intervals": {},
"facet_heatmaps": {}
}
}
datadavev commented 1 year ago

Streaming expressions seem to offer the best solution, e.g. two res=1, seven res=3:

list(
  facet(
    isb_core_records,
    q="producedBy_samplingSite_location_h3_1:(8126fffffffffff 8149bffffffffff)",
    buckets="producedBy_samplingSite_location_h3_1",
    count(*)
  ),
  facet(
    isb_core_records,
    q="producedBy_samplingSite_location_h3_3:(8366c0fffffffff 8366c1fffffffff 8366c2fffffffff 8366c3fffffffff 8366c4fffffffff 8366c5fffffffff 8366c6fffffffff)",
    buckets="producedBy_samplingSite_location_h3_3",
    count(*)
  )
)

http://localhost:8984/solr/isb_core_records/stream?expr=list... result:

{
  "result-set": {
    "docs": [
      {
        "producedBy_samplingSite_location_h3_1": "8149bffffffffff",
        "count(*)": 3686
      },
      {
        "producedBy_samplingSite_location_h3_1": "8126fffffffffff",
        "count(*)": 3292
      },
      {
        "producedBy_samplingSite_location_h3_3": "8366c6fffffffff",
        "count(*)": 50
      },
      {
        "producedBy_samplingSite_location_h3_3": "8366c3fffffffff",
        "count(*)": 42
      },
      {
        "producedBy_samplingSite_location_h3_3": "8366c0fffffffff",
        "count(*)": 40
      },
      {
        "producedBy_samplingSite_location_h3_3": "8366c2fffffffff",
        "count(*)": 26
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 58
      }
    ]
  }
}
datadavev commented 1 year ago

Here's a rough prototype: https://github.com/datadavev/seeh3

The main method to look at is main. get_h3_grid which given a bounding box and optional resolution and query computes cell counts for the resolution and returns geojson polygon features of the h3 cells with the counts. It was all hacked together quickly when I was looking into different approaches, so there's extra stuff in there that isn't needed.

The script should be runnable locally, but you will need to make solr available on 8984, e.g. by tunneling to mars.

If resolution isn't provided, then there's a rough estimate made based on the longitudinal spread of the bounding box, which should probably really be done by the client.

Right now the maximum number of facet counts is unlimited (-1). It may make sense to impose some sensible limit (perhaps 100k?) to prevent things like streaming all resolution 16 counts for the whole planet.

A couple issues encountered:

  1. Cesium does not like geojson that includes the poles. Hence h3 cells that cover the poles are excluded from the generated gejson.
  2. Cesium and leaflet do not like polygons that cross the anti-meridian (-180 longitude). I found a convenient lib (antimeridian-splitter) that splits polygons that cross the anti-meridian which seems to work well.

Geojson is pretty verbose. It could be shrunk a little by reducing the number of decimal places in the coordinates (6 decimal places would be needed at the higher resolutions, 4 would be sufficient for h3 resolutions up to 10).