[2.1] Feature ID datatype

ARolek commented 7 years ago

We recently received a request to Support text ids for Feature IDs on our MVT server. Looking at the spec, it appears there is no restriction on the datatype for Feature ID:

A feature MAY contain an id field. If a feature has an id field, the value of the id SHOULD be unique among the features of the parent layer.

The provided protocol buffer file has type uint64 for Feature IDs. Was the datatype for Feature ID left out of the spec intentionally to allow for future support of different data types, or is the intent to always require a uint64 type for Feature IDs?

flippmoke commented 7 years ago

The intent was to always rely on a int64 type for the version 2.x release, it is good to note that this is perhaps not well described in the specification. In a future release of the specification we are considering the addition of non integer ids. Until, this many people are simply using a property field to supply an id of non-integer formatting.

ARolek commented 7 years ago

Got it. The property filed is a great work around.

e-n-f commented 6 years ago

For VT3, I propose that there should be a new optional message in the feature:

                optional uint64 id_value = 5;

that makes the ID be a reference into the layer values table, supporting any data type that is allowed there.

anandthakker commented 6 years ago

@ericfischer how would this interact with https://github.com/mapbox/vector-tile-spec/issues/75#issuecomment-250011684? Similarly, does this mean that true and false would be valid feature ids? Unless there are compelling use cases for ids that aren't strings or numbers, I'd be in favor of sticking with those types so as to not have VT feature ids be unrepresentable in GeoJSON (the inverse of the current problem).

e-n-f commented 6 years ago

@anandthakker Oh you're right: GeoJSON only allows strings or numbers for IDs, so we should probably follow that precedent. I had thought any JSON object was a legal GeoJSON ID.

sgillies commented 6 years ago

@flippmoke @anandthakker @ericfischer please don't allow numbers or strings only because GeoJSON does. This is a defect in the IETF spec that only exists because standards making is messy and compromises have to be made sometimes. Standardize on strings. All integers are representable as strings (at the cost of a few extra bytes) and strings are much more expressive than numbers.

e-n-f commented 6 years ago

@sgillies So a future revision of the GeoJSON spec will treat numeric IDs and stringified numbers as canonically equivalent?

Putting the backward compatibility hat on, I would urge that even if strings are the canonical ID form, we should still represent any IDs that happen to be 64-bit integers as numbers, because that's what they are now, and files and tools that work now should continue to work.

andrewharvey commented 6 years ago

Mapbox Datasets use strings as autogenerated Feature IDs and it would be nice to have stable IDs between the Dataset and the Tileset created from the Dataset. So unless Mapbox Datasets changes I think that means vector tiles need to support string ids.

asheemmamoowala commented 6 years ago

Having arbitrary type for IDs seems unnecessary. I second @sgillies that it would be best for IDs to be a single known type - string, instead of supporting variants. In addition to being able to convert between GeoJSON's string ids it would also allow conversion to string representation from other types, JSON serialization, and unproblematic representation in Javascript.

flippmoke commented 6 years ago

Despite their use in a lot of datasets, strings are often terrible IDs. Searching and indexing of IDs with strings is much more difficult then simply using an integer. I know that string ids are something that is often requested, but I feel like removing integers is a bad decision. I think large size integers are likely the best solution for strings that could be computed into bit hashes and could prove to be more useful then full IDs. However, I know that this is not an ideal solution in all cases.

I am willing to yield, that we could provide string ids in VTs but I think its often a bad design (for the data or for systems that would choose to use it). I do not think that dropping integers would help here as it would push even simply numbers to be represented as strings causing extra bloat to all vector tiles.

sgillies commented 6 years ago

@flippmoke a problem with integer Ids that I see cropping up at Mapbox is this: when you combine two sets of features that each have integer ids, we get id collisions. With string ids, collisions are very avoidable: GUIDs and UUIDs collide rarely and are cheap to generate. Avoiding integer collision looks like a much harder problem to me, requiring all sets of features everywhere to be reconciled to a single list of integers.

e-n-f commented 6 years ago

We have to continue to support integers because they are in use by existing code and existing tiles. We can't remove existing features.

GeoJSON wants strings and users want strings so we should allow the use of strings, even though they are not optimally efficient.

The big question for me is whether string representations of numbers are canonically equivalent to those numbers. Is "1234" the same ID as 1234?

If they are equivalent, we can tell tile readers and writers to always represent IDs as strings in their internal data structures, and encourage them to use the numeric representation in tiles for strings that have the form of numbers.

If they are not canonically equivalent, then everybody's software needs to either support variant keys or have two tables of keys so they can look up IDs of either type.

joto commented 6 years ago

Is there anything in the vector tile spec itself that needs IDs for "internal vector tile" needs? I don't think so. Which means the vector tile spec shouldn't put any meaning into IDs, so it should not proscribe that "1234" is the same as 1234. It would make everything hugely difficult anyways, because what about "01234" or an integer string that is so long it doesn't fit into our integer type or spaces in the string etc.

I am still not convinced that string ids are a good idea. Yes, being compatible with GeoJSON would be nice, but then again, being compatible with Shapefiles is also nice, and they don't have string ids. However you turn it, this will be compatible to some things and incompatible to others.

Maybe before we decide this we have to ask what those IDs are even used for. Why can't the string id that some user has not simply be stored in a property? We allow any number of properties, is this not enough? Datasets often don't just have a simple id, but several, based on where the data comes from. All of this can be put into multiple id properties. And you can even give them more sensible names (like UUID, or building_id_from_registry_foo).

This discussion is similar to the one we have in OSM all the times: People want OSM Ids to be stable. But OSM ids don't refer to real-world object but are just an internal handle used to connect data inside OSM together. If you need a stable ID, just put the one you have from outside OSM into an OSM tag and everything is fine.

Why do we even have an ID that's special and not just another property? Because everybody else has? It only makes sense to have this ID at all if we attach any special meaning to it really, and we don't. We say it has to be unique, presumably to allow merging split objects, but this is all rather vague...

mapbox / vector-tile-spec

[2.1] Feature ID datatype #94