Streamable, flatlined geobuf format revision

mourner commented 9 years ago

A new fomat revision to enable the possibility of streaming reads and writes, discussed at #37.

The encoding/decoding implementation here isn't streaming — I think this is a task for a separate repo (geobuf-stream). The spec should also be in a separate geobuf-spec repo.

The current Geobuf encodes nested protobuf messages to reflect the GeoJSON structure (with nested objects, e.g. FeatureCollection with a bunch of Features, each having a Geometry or GeometryCollection with a bunch of geometries). The idea behind the new one is that basically things are encoded in a flat way like this:

FeatureCol # (top level)
Feature # child of FeatureCol
GeometryCol # child of Feature
Geometry # child of GeometryCol
Geometry # child of GeometryCol
Feature # child of FeatureCol
Geometry # child of Feature
Feature # (top level)
Geometry # child of Feature

The nature of nesting in GeoJSON allows us to derive the structure, e.g. Feature can only belong to a FeatureCollection (so if it follows one, it’s a child).

This flat way of encoding makes streaming and partial reads and writes much easier while simplifying the format itself significantly.

To support streaming multiple GeoJSON datasets and concatenating Geobufs, there's a is_top_level field to indicate whether an encoded object is the root of a different GeoJSON object (rather than a child of previous one).

The only case we’re dropping is support of GeometryCollection within another GeometryCollection which @sgillies called bullshit and said we can safely not support.

Here's the new schema proto: https://github.com/mapbox/geobuf/blob/stream/geobuf.proto

[x] update schema
[x] update encoding
[x] update decoding
[ ] get some feedback
[ ] update changelog
[ ] release 2.0.0

mourner commented 9 years ago

Just got a new cool idea: ditch the Header message (expected at the beginning) and embed keys/precision/dimensions info in objects directly, not encoding it in subsequent objects if they are the same as in the current one. This will keep the compression ratio while making the format freely concatenatable (and also allow having data with different precision/dimension/keys in the same file).

mourner commented 9 years ago

@joto @sgillies @springmeyer @mick @artemp @brendan-ward would love more eyes on this so that I could go forward with the approach, in case I missed anything important. Updated the PR describing the main idea.

mourner commented 9 years ago

Also of note is that with simple approach like this, we could still easily make a more complex wrapper format that chunks the data as proposed by @joto here in case we need it later — the approaches are compatible.

joto commented 9 years ago

The ideas with the flattening of the hierarchy and the removal of the header are interesting, but I am not sure it is going in the right direction. As I see it, the purpose of the header is to store metadata thats valid for the whole file. That allows you to get an idea of whats in a file without reading the whole file. This isn't possible any more in the new format. Say you have a piece of software that can only work with 2D data, not 3D. It has to read the whole file before it can be sure it will work, because only after reading the last object it knows there is no 3D data in there. I think it is important to have a proper header or at least the space for one if it is needed later.

The whole format is now rather confusing, because you are not using Protobuf in the way it was intended. Instead of having messages of different type there is basically only one type called Object (always a dead givaway if something has such a generic name) and a new type system is added on top of Protobuf using the ObjectType enum. Similarly with the Value. This is the result of trying to encode a rather lax format (GeoJSON) in something that is made for strongly typed data (Protobuf). Instead of using Protobufs strengths, it is only used as part of the encoding leaving a lot to be implemented outside the protobuf part. For instance we probably have to document and check somewhere that a FeatureCollection can only be followed by a Feature, or that a Feature can not contain coords or whatever (not sure about those details myself).

If we are going this way I could imagine that a cleaner approach could be to take the object type out of the Protobuf part. This would look like this:

1 byte type
protobuf message of type given above
1 byte type
next protobuf message
...

This way each protobuf message can be of different type and it doesn't have to be Object.

As a side note: default values in Protobuf should probably be avoided, version 3 doesn't support them any more and protozero can't handle them transparently.

mourner commented 9 years ago

Thanks for the feedback, super-helpful! I respectfully disagree with your main point though.

As I see it, the purpose of the header is to store metadata thats valid for the whole file. That allows you to get an idea of whats in a file without reading the whole file. This isn't possible any more in the new format.

I don't think we care much about being able to describe the whole file. Having strict limits of what can be within one file is not compatible with our goal to retain all GeoJSON ambiguity in the data. If we wanted to make a strict geo format, we wouldn't consider converting from/to GeoJSON at all — maybe that's a task for a completely separate format design project.

Say you have a piece of software that can only work with 2D data, not 3D. It has to read the whole file before it can be sure it will work, because only after reading the last object it knows there is no 3D data in there.

With my proposed approach, this piece of software could read the first few bytes of the file and already know that the first dataset encoded in that file supports 2D (because this info is encoded in the top-level object), and deal with subsequent datasets separately. Generally I think we need to think about Geobuf in terms of streams rather than files — the file separation is implementation detail that shouldn't be a concern of software that consumes the data. The latter should care about individual datasets (and features/geometries within them).

Instead of using Protobufs strengths, it is only used as part of the encoding leaving a lot to be implemented outside the protobuf part.

Not a lot — the encoding and decoding code actually got simpler and shorter with this PR as you can see. I don't see any problem with having generic Object message, because there'd be pretty much no difference in code if we move the type byte outside of the message — we would handle the contents of each message the same way. The types are also not too different — actually, the only difference is that only geometries can have coordinates, and typically only features have ids and properties. If we make the code for encoding/decoding each message type separate, it will get more complex and verbose.

For instance we probably have to document and check somewhere that a FeatureCollection can only be followed by a Feature

This is easy to do, and the alternative would be to use Protobuf embedded messages, which is slow and not compatible with streaming.

or that a Feature can not contain coords or whatever

It can contain anything. So is GeoJSON. I think this is the responsibility of the consumer to interpret things like features with coordinates, even if it means ignoring them completely.

As a side note: default values in Protobuf should probably be avoided, version 3 doesn't support them any more and protozero can't handle them transparently.

They're used more as documentation than an instruction to a protobuf decoder. I could as well remove them from proto schema but still describe handling defaults in the spec.

mourner commented 9 years ago

Looks like we need some more eyes on this to make a decision. Let's wait for some people to get back from vacation and weigh in. :)

brendan-ward commented 9 years ago

@mourner I have not had a chance to take this new version for a test drive, but I like the move toward simplicity that I am seeing here.

I think some additional comments in geobuf.proto with respect specifying the fields that are only valid for a particular ObjectType would add some helpful clarity, e.g.,

    // Applies only to ObjectType in (POINT, MULTIPOINT, LINESTRING...)
    // geometry coordinates (where the bulk of the size goes)
    repeated uint32 lengths = 6 [packed = true]; // coordinate structure in lengths
    repeated sint64 coords = 7 [packed = true]; // delta-encoded integer values

But that could also just go into documentation alongside the spec in the dedicated repo you suggested above.

And since this represents a new version, a bit more bikeshedding: what about reordering the fields in Value to either be alphabetical or based on increasing complexity:

    message Value {
        oneof value_type {
           bool bool_value = 1;
           uint64 pos_int_value = 2; 
           uint64 neg_int_value = 3; 
           double double_value = 4;  
           string string_value = 5;
           string json_value = 6;
        }
    }

To address some of the concerns raised by @joto above about checking validity, I wonder if that doesn't belong in a separate repo / project from this implementation; something whose sole purpose is to check the validity of the geobuf and make sure that your rules about layout of fields with respect to type are properly adhered to. Would likely be handy for implementers of the spec. E.g., what @tmcw and others did for GeoJSON...

mourner commented 9 years ago

One more idea: instead of is_top_level, we can do is_last. This way we can emit a GeoJSON object as soon as it is parsed without the need to parse further.

mourner commented 8 years ago

Going to deep dive into this again (really want us to get Geobuf out the door sooner).

Just one casual thought I just had: either is_top_level or is_last can have ambiguities in certain situations, but we can make things more explicit by introducing an enum field instead that would have e.g. 1 to indicate start of a nested structure and 2 to indicate the last object in it, and 0 (default, not usually encoded) for everything else (derived from the object type/order).

tmcw commented 8 years ago

I think I agree with @joto's feelings around a generic Object type, since it really does seem to make the protobuf file itself less explanatory and could use protobuf's types more than it does. Moving object types from the enum field into the protobuf schema itself seems like it'd make the format more straightforward, since you wouldn't have unused 'coordinates' within a Feature object.

A global header at the top of the file, though, seems like it runs counter to streaming, unless implementations do fancy magic to append at the beginning of the file.

mourner commented 8 years ago

@tmcw this is tricky — because of the arbitrary nature of GeoJSON, if we do a separate Protobuf type for each object type, most of the properties have to be duplicated in addition to the complexity of defining type outside of the message (making streaming more cumbersome). I can make a .proto version with separate messages, then we can compare and make a decision.

The types are also not too different — actually, the only difference is that only geometries can have coordinates, and typically only features have ids and properties. If we make the code for encoding/decoding each message type separate, it will get more complex and verbose.

mattwigway commented 8 years ago

This is also important, streaming reads/writes aside, to allow for large datasets to be stored in GeoBuf. Google's CodedInputStream has a default maximum size limit of 64MB, and changing it in the context of Protobuf generated code is tricky. (Experienced using java and conveyal/geobuf-java but likely to be a problem with any tool that makes use of Google's protobuf libraries).

Background: I have Census data for 195,000 Census blocks in the Los Angeles metro region (produced using our seamless census toolset). The PBF is 87MB, which is too large to decode.

mourner commented 8 years ago

@mattwigway that's only if you use Google protobuf libraries, which are very inefficient. Geobuf encoder/decoder uses pbf module (which could be easily ported to Java probably too).

mourner commented 8 years ago

Made an attempt at creating a schema with explicit top-level types that's now much closer to @joto's vision of the format: geobuf-explicit.proto. It's much more verbose (a lot of fields are duplicated across types), but at the same time should be easier to understand and implement while enabling easy streaming reads/writes and multithreaded reading. Main points:

Instead of a top-level Object message, we have many top-level types (feature collection, feature, point, linestring etc.) which are prefixed with tags (like protobuf fields) to differentiate between types.
Metadata is put in a separate message, and now also contains a fixed-length block_size field that indicates the amount of data until the end of the block. This enables multi-threaded reading in future — the field allows us to immediately jump to the next block if necessary. Additionally it has a version field, and it's required (together with dimensions and precision).
Each feature collection or geometry collection is followed by a special collection_end field; this makes streaming reads and writes easy since we can emit a collection as soon as we hit the end field, and we don't need to know beforehand whether each feature/geometry is the last one coming in a collection when we encode it.

@joto @tmcw @sgillies @springmeyer @flippmoke @brendan-ward would appreciate any feedback.

joto commented 8 years ago

@mourner Just a few things I noticed right away: The "repeated Value" is inefficient, because it can not be packed. For small integers for instance this will have some overhead because the type has to be encoded again and again. Not sure we can avoid this. I am not sure I understand how the block_size is supposed to work. You can not store the size of the Metadata block inside itself, so I assume it is the sum of the length of all the Object messages after the current one until a new Metadata block is in there? This mixes different logical layers. I have to look into the Metadata inside the Object before I can figure out how much more data to read? I would prefer a block structure that's completely outside everything else allowing for a clear separation of concerns,

mourner commented 8 years ago

@joto excellent points!

The "repeated Value" is inefficient, because it can not be packed. For small integers for instance this will have some overhead because the type has to be encoded again and again. Not sure we can avoid this.

Initially I did it this way because this is how values are encoded in the Vector Tile spec. But it looks like there's a way to take advantage of packed encoding like this, having multiple arrays of values (one for each type). We should also probably store values in metadata rather than each feature because if we split data into blocks, there's no risk of blowing up the values size, so I moved it to metadata. This scheme will make encoding/decoding a bit trickier but it's probably worth it.

I have to look into the Metadata inside the Object before I can figure out how much more data to read? I would prefer a block structure that's completely outside everything else allowing for a clear separation of concerns,

Yeah, and if we're doing multi-threaded reading, we would need to read keys/values in the same thread that reads everything else, so we need the block size outside. Would the following be OK? (reflected in the last gist revision)

block_size
metadata
features/geometries
block_size
metadata
features/geometries
...

mourner commented 8 years ago

@joto @tmcw any feedback on the last comment? Relevant proto.

joto commented 8 years ago

@mourner: I don't know how you want to adress the multiple packed arrays for each type from the one key/value "pointer" in the features. If you number through all the arrays, you have to know how many int values there are going to be before you can start numbering the double values etc. Such a format would be very expensive to write, because you'll need multiple passes over the input data. If you use different numbered ranges, you loose the advantage of the varint encoding of the properties.

Regarding the blocks: Yes, that's basically how it would look like. In addition, a fixed header with some magic data first thing in the file would be nice, for instance 8 bytes GEOBUFvv with vv being the version in ASCII. Makes it really easy to detect the file format.

mourner commented 8 years ago

If you number through all the arrays, you have to know how many int values there are going to be before you can start numbering the double values etc. Such a format would be very expensive to write, because you'll need multiple passes over the input data.

@joto But don't you have to do 2 passes per block anyway because you have to encode unique keys/values in the meta header, and need to write it first before writing any features? With multiple value arrays, it would be the same process. Am I missing something?

joto commented 8 years ago

@mourner There are, of course, different ways to implement the writer depending on the details of how you can access your data that you want to write. Say you have all your objects in some kind of array. You can walk over that array and do three things for each object you encounter

Generate some kind of data structure (like a tree) to keep attribute -> index mapping and de-duplicate attributes
Write out new attributes into the metadata block
Write out objects into the features/geometries block using the mapping from step 1. When you are done, discard data structure from step 1 and write out result of step 2 and 3. This only needs one pass. It needs some memory to hold intermediate results, but not too much if the blocks are not that large.

mourner commented 8 years ago

@joto it seems to come down to a choice between either getting a hit in resulting file size (encoding values in one non-packed array of embedded messages) or getting a hit in encoding speed due to the need to read properties twice. What's the better route here?

joto commented 8 years ago

@mourner I don't think we have exhausted the design space yet. For instance in most cases all attribute values with the same key will have the same type, because that's basically what a "layer" is in the traditional GIS sense. Maybe we can take this information info account somehow.

mourner commented 8 years ago

@joto in most cases, but not always. Given that we want lossless GeoJSON <-> Geobuf conversion, I wouldn't rely on this.

I don't think we have exhausted the design space yet.

This may be worth deferring to future iterations, since with the current pace, exhausting the design space may take a few years if we want the format to be flawless.

mourner commented 8 years ago

@joto thought about this for a while and I can't come up with a flawless design that would allow us to avoid overhead of values packed in repeated messages while not introducing double-pass encoding overhead. How about we go with repeated values for now? I just don't want Geobuf development to be stuck because we haven't exhausted the design space — it could drag on forever. cc @tmcw

joto commented 8 years ago

@mourner I can only help with ideas, but somebody else has to make the decision when Geobuf is good enough for a release. I don't have a use case for Geobuf at the moment, so I can't compare what we have with what I need. You, obviously, need this thing, so if it is good enough for you, go for it.

tmcw commented 8 years ago

get some feedback

From my perspective, this is good to go.

mourner commented 8 years ago

It regressed on decoding speed so I want to delve into this deeper again before considering merge.

mourner commented 8 years ago

The PR is out of date now, so I'm closing it and will follow-up later with a new PR. The latest revision of the schema I'm envisioning for the next major version of Geobuf is https://gist.github.com/mourner/8754884156f3f34d8ee9. Changes since last revision:

Per discussion above, reverted values encoding back to a non-packed repeated array of values.
Merged geometry types into one Geometry message because the fields are the same (only interpretation of packed coordinates changes), and you can differentiate between them through field tags.
Same for feature and geometry collections becoming Collection.
collection_end type changed from bool to Collection to be able to encode any custom properties that come after the main bulk of the collection (e.g. after the features property).

mapbox / geobuf

Streamable, flatlined geobuf format revision #51