usgin / usgin-cache

Cache a whole system in your CouchDB
0 stars 1 forks source link

State of the App #35

Closed rclark closed 10 years ago

rclark commented 10 years ago

@smrazgs @jalisdairi @asonnenschein

The short version

We've got an effective way to go out and cache ALL the data from the NGDS. Visualization remains a hard problem because the clustering algorithm that we'd hoped to use can't handle the number of points we're throwing at it.

The long version

We worked hard last week and we got this indexing code going. What we've accomplished is a server-side program that:

This is then "demonstrable", but unfortunately, it won't scale. Everything up until the indexing is pretty bomb-proof. The clustering is the weak link. Basically, the clusterer runs out of memory when I start asking it to build across ~200,000 features. This isn't because my machine is out of RAM, think of it more like when a Java VM runs out of memory. In Node.js, there isn't a way to increase that memory allocation.

So, a node.js-based clustering mechanism is not the right one for so much data. I'm not, however, aware of great alternatives.

Adrian pointed out a stack exchange article about how you might be able to do clustering in PostGIS. This might be promising.

I've written the code so that its capable of pushing data from the OGC cache into PostGIS tables. I'll make sure that's working smoothly, and this might give us an alternative way to cluster points.

Personally I think we have to do some serious brainstorming about how to visualize all this data. Some other ideas include:

Basically, open to suggestions...

rclark commented 10 years ago

The kmeans thing was easy to install and use.

git clone https://github.com/umitanuki/kmeans-postgresql.git
cd kmeans-postgresql
export USE_PGXS=1
make
make install

Then I had to find where it put kmeans.sql. On my system it ended up here:

/usr/local/Cellar/postgresql/9.3.1/share/postgresql/extension/kmeans.sql

After finding it, I made a database, and ran this script to add the clustering routines:

createdb ngds
psql -d ngds -c "create extenson postgis;"
psql -d ngds -f /usr/local/Cellar/postgresql/9.3.1/share/postgresql/extension/kmeans.sql

I pushed 280,000-some borehole temp points into the database, and tried the same query that he suggested in the stack exchange post:

SELECT kmeans, count(*), ST_Centroid(ST_Collect(wkb_geometry)) AS geom
FROM (
  SELECT kmeans(ARRAY[ST_X(wkb_geometry), ST_Y(wkb_geometry)], 5) OVER (), wkb_geometry
  FROM boreholetemperature
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

Which worked just fine, but that 5 is the important toggle. How many "cluster points" you want is a kind of function of which zoom level you're working at. For this app we essentially need to cluster from zoom 0 to about 10. At zoom level 0, just a point or two will suffice, but I don't really know how to scale up through zoom levels.

The next thing I did was run it with 500 cluster points, query has been running for ~6 min and hasn't returned yet. I suspect it will return, but it won't be fast. I'm also going to guess that we'll need more many than 500 cluster points at zoom level 10.

rclark commented 10 years ago

Not very happy with the result when I map it. Here's the result for my BHT data clustered into 50 points: https://gist.github.com/rclark/f032681a066cc7490675

rclark commented 10 years ago

Another view of the clustering. Each poly is one cluster, and the extent of the poly is the convex hull around the points it contains. Click to see how many points there are per polygon.

https://gist.github.com/rclark/7837c665def236cec899

PostGIS is a powerful thing.

SELECT kmeans, count(*), 
       st_asgeojson(st_transform(ST_Centroid(ST_Collect(proj_geom)), 4326)) AS geom,
       st_asgeojson(st_transform(st_convexhull(st_collect(proj_geom)), 4326)) as poly
FROM (
  SELECT kmeans(ARRAY[ST_X(proj_geom), ST_Y(proj_geom)], 50) OVER (), proj_geom
  FROM (
    SELECT st_transform(st_setsrid(wkb_geometry,4326), 3857) as proj_geom
    FROM boreholetemperature
  ) as projected
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;
rclark commented 10 years ago

@smrazgs @asonnenschein @jalisdairi

Quick demo of tile-based visualization: https://a.tiles.mapbox.com/v3/azgs.gia5klal/page.html?#5/38.238/-101.250

rclark commented 10 years ago

The postgis branch is starting to be the solution to our clustering problem.

Problems / Solutions