State of the App - Githubissues

rclark commented 10 years ago

@smrazgs @jalisdairi @asonnenschein

The short version

We've got an effective way to go out and cache ALL the data from the NGDS. Visualization remains a hard problem because the clustering algorithm that we'd hoped to use can't handle the number of points we're throwing at it.

The long version

We worked hard last week and we got this indexing code going. What we've accomplished is a server-side program that:

Is controlled through a simple command-line interface
Harvests a CSW by paginating GetRecords requests and then making GetRecordByID requests. Responses are just cached in CouchDB. Since there's no processing, the whole thing takes a matter of seconds.
Finds WFS URLs in the metadata records based only on regular expression matching.
Makes and caches GetCapabilities requests for all the WFS URLs that it finds
Allows the user to specify a FeatureType, and makes all the GetFeature requests that the cache knows about, and stashes those GML response documents.
Again, given a particular FeatureType, converts all the GML docs into GeoJSON, and indexes the features in Solr based on user-defined "mapping" functions.
Can build and store "clustered" points for specific zoom levels including whatever is in the Solr index.
Has a simple client interface that pulls clustered points from the cache until you zoom in close enough, at which point it starts showing data directly from the Solr index (by BBOX of where your map is viewing).

This is then "demonstrable", but unfortunately, it won't scale. Everything up until the indexing is pretty bomb-proof. The clustering is the weak link. Basically, the clusterer runs out of memory when I start asking it to build across ~200,000 features. This isn't because my machine is out of RAM, think of it more like when a Java VM runs out of memory. In Node.js, there isn't a way to increase that memory allocation.

So, a node.js-based clustering mechanism is not the right one for so much data. I'm not, however, aware of great alternatives.

Adrian pointed out a stack exchange article about how you might be able to do clustering in PostGIS. This might be promising.

I've written the code so that its capable of pushing data from the OGC cache into PostGIS tables. I'll make sure that's working smoothly, and this might give us an alternative way to cluster points.

Personally I think we have to do some serious brainstorming about how to visualize all this data. Some other ideas include:

A WebGL-based map without clustering. OpenLayers3 is a place to start, but it isn't working smoothly and there's a big learning curve. Also, even if WebGL is capable of rendering a million points, how long do you have to wait to get them transferred to the browser? BoreholeTemperatures alone are about 20MB.
A tile-based approach. Build a fancy TileMill project that depicts striking images of how much data is in the system. Here is an example showing twitter data. The major drawback is limited potential for interactivity. We could build multiple "views" that users could toggle between (as in this example), which could give the effect of filtering through the data. Click and hover interactions would be almost impossible, however, without some effective way to cluster...

Basically, open to suggestions...

rclark commented 10 years ago

The kmeans thing was easy to install and use.

git clone https://github.com/umitanuki/kmeans-postgresql.git
cd kmeans-postgresql
export USE_PGXS=1
make
make install

Then I had to find where it put kmeans.sql. On my system it ended up here:

/usr/local/Cellar/postgresql/9.3.1/share/postgresql/extension/kmeans.sql

After finding it, I made a database, and ran this script to add the clustering routines:

createdb ngds
psql -d ngds -c "create extenson postgis;"
psql -d ngds -f /usr/local/Cellar/postgresql/9.3.1/share/postgresql/extension/kmeans.sql

I pushed 280,000-some borehole temp points into the database, and tried the same query that he suggested in the stack exchange post:

SELECT kmeans, count(*), ST_Centroid(ST_Collect(wkb_geometry)) AS geom
FROM (
  SELECT kmeans(ARRAY[ST_X(wkb_geometry), ST_Y(wkb_geometry)], 5) OVER (), wkb_geometry
  FROM boreholetemperature
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

Which worked just fine, but that 5 is the important toggle. How many "cluster points" you want is a kind of function of which zoom level you're working at. For this app we essentially need to cluster from zoom 0 to about 10. At zoom level 0, just a point or two will suffice, but I don't really know how to scale up through zoom levels.

The next thing I did was run it with 500 cluster points, query has been running for ~6 min and hasn't returned yet. I suspect it will return, but it won't be fast. I'm also going to guess that we'll need more many than 500 cluster points at zoom level 10.

rclark commented 10 years ago

Not very happy with the result when I map it. Here's the result for my BHT data clustered into 50 points: https://gist.github.com/rclark/f032681a066cc7490675

rclark commented 10 years ago

Another view of the clustering. Each poly is one cluster, and the extent of the poly is the convex hull around the points it contains. Click to see how many points there are per polygon.

https://gist.github.com/rclark/7837c665def236cec899

PostGIS is a powerful thing.

SELECT kmeans, count(*), 
       st_asgeojson(st_transform(ST_Centroid(ST_Collect(proj_geom)), 4326)) AS geom,
       st_asgeojson(st_transform(st_convexhull(st_collect(proj_geom)), 4326)) as poly
FROM (
  SELECT kmeans(ARRAY[ST_X(proj_geom), ST_Y(proj_geom)], 50) OVER (), proj_geom
  FROM (
    SELECT st_transform(st_setsrid(wkb_geometry,4326), 3857) as proj_geom
    FROM boreholetemperature
  ) as projected
) AS ksub
GROUP BY kmeans
ORDER BY kmeans;

rclark commented 10 years ago

@smrazgs @asonnenschein @jalisdairi

Quick demo of tile-based visualization: https://a.tiles.mapbox.com/v3/azgs.gia5klal/page.html?#5/38.238/-101.250

rclark commented 10 years ago

The postgis branch is starting to be the solution to our clustering problem.

adds a function capable of reading from the mapping functions (the same functions that feed Solr) and writing those features to PostGIS
adds a function that returns clustered points in a particular area: you pass in a BBOX, it gives you back cluster points for data in that area.
adjusts the client to pull clusters from PostGIS if there are more than 3000 data points on your map. If there are less than 3000 points, it just displays the points directly.

Problems / Solutions

Still doesn't scale: it takes PostGIS ~8 seconds to cluster 200,000-ish points. That's too long to wait for a map. The solution might be to pre-cache using PostGIS and store the cached cluster points. This is going to mean another pre-processing step and will probably involve several thousand individual PostGIS queries. More on that later.
The PostGIS query currently does not give us any information about the nature of the data points contained in each cluster. It knows how many points are in a cluster, but that's all. At a minimum we want to know how many of each feature type are in each cluster.
Actually, the query can only cluster on one feature type at a time. Ideally we're going to want to cluster across featuretypes I think.
The switch from clusters to data points is very abrupt. May want to implement client-side clustering when the server is returning < 3000 features, just to smooth the transition.

usgin / usgin-cache

State of the App #35

The short version

The long version