Features are being dropped randomly with no systematic order

Zia- commented 2 months ago

Hi there,

I'm new to Martin. I've been using Tippecanoe so far, and it offers some ways to control how features should be dropped if they exceed a certain limit per tile. I'm not able to control this in Martin. Kindly look at the following two scenarios:

Slightly left view: Slightly right view:

Here, I'm simply panning slightly to my right and the big yellow cluster suddenly appears.

Also, when I start zooming out it randomly drops all features from random locations giving an impression that there is nothing in there. I need to zoom in enough to make sure.

Low Zoom:

High Zoom:

I have made sure it's not network timing out while loading the tile. Below is what my config.yaml looks like, which I copied from https://maplibre.org/martin/config-file.html. Am I missing something? Any kind of help to make it performant while not loosing the impression of where the polygons are at various zoom levels and pans would be extremely helpful.

keep_alive: 75
listen_addresses: '0.0.0.0:3000'
base_path: /tiles
worker_processes: 3
cache_size_mb: 1024
preferred_encoding: gzip
web_ui: enableforall
postgres:
  connection_string: '<my connection string>'
  default_srid: 4326
  pool_size: 20
  max_feature_count: 1000
  auto_bounds: skip
  auto_publish:
    from_schemas:
      - public
    tables:
      source_id_format: 'table.{schema}.{table}.{column}'
      from_schemas: my_other_schema
      id_columns: feature_id
      clip_geom: false
      buffer: 64
      extent: 4096
    functions:
      source_id_format: '{schema}.{function}'
  tables:
    table_source_id:
      layer_id: id
      schema: public
      table: test_table
      srid: 4326
      geometry_column: geom
      id_column: ~
      minzoom: 0
      maxzoom: 10
      bounds: [ -180.0, -90.0, 180.0, 90.0 ]
      extent: 4096
      buffer: 64
      clip_geom: true
      geometry_type: GEOMETRY

CptHolzschnauz commented 2 months ago

I am also a newbie but maybe I can help a bit here on GitHub because Martin is such a GREAT piece of software. Thanks to the contributors!! So, IMHO you show here the example without adaptions? You do auto_publish AND declare tables so I assume you removed the connection sring and Martin found your table by i'ts own, otherwise you would see nothing. To my understanding, your question is related more to the renderer which has it's issues to display the geometries?

nyurik commented 2 months ago

i think we had a few asks about controlling which features go into a tile. At the moment, we only support max_feature_count, but i'm always open to discussing more options... Note that more complex algorithms may result in substantial perf costs

CptHolzschnauz commented 2 months ago

Perfomace is everything! According to my understanding (the doc is not complete on this) it should be possible to provide arguments to PSQL which makes it possible to achieve this in a PSQL function? IMHO this should be either a task for the DB or the viewer, not Martin.

nyurik commented 2 months ago

yes, all this is possible in a pg function. Take a look at the demo site which passes and uses params

CptHolzschnauz commented 2 months ago

Yes, the Taxi sensation, you're right... I meant "TODO: Modify this example to actually use the query parameters." in https://maplibre.org/martin/sources-pg-functions.html

Anyway, I just discovered Martin a few days ago and I'm thrilled! Thank you and all the contributors for this fantastic piece of software!

Zia- commented 2 months ago

Thanks for your comments @nyurik and @CptHolzschnauz . Let me give you one example to explain the kind of behavior I'm trying to achieve.

Below is the screenshot of 2000 randomly generated points (can't upload the geojson but happy to share if needed for testing). Kindly note how they are spread. I have spread them into two clusters, each one having 1000 points each.

Now when I upload this table, setting max_feature_count: 1000, in QGIS, at zoom 5 I see the data like this:

When I create vector tiles against the same data using Tippecanoe using the following command: tippecanoe -e tiles_dir/ 2000_points.geojson -Z0 -z10 -d10 -O1000 -pk at zoom 5 it looks like this:

My say is, apart from all the possibilities of having a pg function that can be used to neatly filter features on the fly always keeping them well within the _max_featurecount limit, if there is a scenario where there is no pg function or the function happened to return more than 1000 features there is no systematic order in which Martin drops them. IMHO, tile coming from Tippecanoe is more representative of what's there on the ground (although those points are mostly synthetic, guess by taking an average of nearby points and so you won't find them exactly at the same location in the original data; but from the visual point of view it's acceptable IMHO). At least we can say there are roughly two clusters with around equal number of features. In case of Martin, it looks like there is nothing in South-East region. One only needs to zoom in enough to figure out the actual spread of points.

Having an inbuilt feature to drop features beyond the _max_featurecount limit in a way that doesn't loose the essence of the data (mainly their spatial spread and density) would be great.

Hope I managed to give a reasonable explanation.

Just sharing one more screenshot, this is what Martin shows at zoom 4

nyurik commented 2 months ago

Thx for all the in depth explanation! So, two things:

the database side: if Martin is to remain a thin and fast wrapper, it should rely on PG to get the needed data. Which means using SELECT ... LIMIT 1000. In other words, which features are actually included is decided as part of the query. Downloading all features from PG to Martin, and then doing filtering/clustering/randomness and dropping them might be inefficient. In theory, we could use ORDER BY RANDOM(...) LIMIT N, but again the performance implications will need to be evaluated, and even if desired, this will probably be an opt-in feature.
the tippecanoe aspect - this one is easier - could you take a look at their code and see what they do? :)

Zia- commented 2 months ago

Thanks for your quick reply @nyurik So, from my end, I'm pretty happy doing the PG part of it and not touching Martin. It was just a suggestion given my experience with Tippecanoe. As far as Tippecanoe is concerned all I know is EricaF https://github.com/e-n-f https://www.linkedin.com/in/erica-fischer-916a3615b/ is the main developer and is written in C (something that I know very little about :) It has been forked here and currently maintained by Felt https://github.com/felt/tippecanoe . Maybe you can create an issue to ask exactly the technical part of it (it's a pretty active repo).

nyurik commented 2 months ago

@e-n-f is great! Maybe we can get their feedback on this?

e-n-f commented 2 months ago

Hi! Just to make sure, you're saying that you want to emulate the tippecanoe behavior, not that you have found a bug in tippecanoe?

If tippecanoe needs to --drop-fraction-as-needed within a tile, it does it in a sequence that is meant to be as spatially uniform as possible. Specifically, it calculates a 64-bit quadkey or 64-bit hilbert index from the X and Y coordinates, and then reverses the order of the bits to give each feature a "drop sequence" number. The features that need to be dropped are dropped in drop sequence order.

Spatialite can give you a quadkey equivalent with st_geohash. I'm not sure how to do the bit reversal in SQL, but maybe string reversal would be close enough to give you a pretty good drop sequence.

e-n-f commented 2 months ago

For polygons I would recommend --drop-densest-as-needed instead of --drop-fraction-as-needed, which calculates its pseudo-density by putting the features in quadkey order and then calculating the distance from the centroid of the feature to the most distant vertex in the following feature. (The reason for using the most distant vertex is to keep large features from being considered high-density just because their centroid happens to be near the previous feature's centroid.)

Both methods have been tuned recently to be more consistent between zoom levels. I can talk about the old behaviors if you want the details, though.

nyurik commented 2 months ago

thanks for such awesome in-depth info! So it sounds like Tippecanoe does not rely on SQL to get just the needed data, and rather retrieves all data, computes geohash (quadkey) of the centroid, and then incrementally drops last digits (picking one item per shortened key) - until it has small-enough dataset. Essentially an incremental clustering approach

e-n-f commented 2 months ago

Right, yes, tippecanoe does not use SQL, but does all of its ordering and processing internally from the full set of features, and yes, the result is effectively clustering, but locating the feature representing the cluster at the location of an arbitrary feature in the cluster rather than at the mean of the clustered features' locations.

nyurik commented 2 months ago

thx for all the feedback @e-n-f! So to make this work, I see two main paths:

(simpler) come up with some opt-in SQL query-based selection method for tables. Evaluate performance impact and add it as a new cool feature.
(complex) implement Martin-based tile encoding, and allow Martin to select which features it may want to encode if there are too many of them. While complex, Martin-based tile creation would offer significant benefits long term, allowing new tile format (MLT) support, more control of how things get created, new data sources (like geojson or geometries from sqlite), etc

maplibre / martin

Features are being dropped randomly with no systematic order #1455