mapbox / tippecanoe

Build vector tilesets from large collections of GeoJSON features.
BSD 2-Clause "Simplified" License
2.72k stars 432 forks source link

Stream in geojson/geobuf and stream out tiles #675

Open dylrich opened 5 years ago

dylrich commented 5 years ago

Hi! Having some trouble setting up tippecanoe to do tiling without actually writing to file and I would love some assistance.

I was able to get geojson to stream with the following command:

echo '{"type":"Polygon","coordinates":[[[0,0],[0,1],[1,1],[1,0],[0,0]]]}' | tippecanoe -L'{"file":"", "layer":"test", "description":"test"}' -e test

Which didn't feel quite right, as I am basically passing in an almost empty layer json just to get it to stream. What is the intended method to accomplish this with geojson? Related to that, how would I pipe in a geobuf with the same method? When I try inserting an example base64 encoded geobuf, e.g. CgRuYW1lGAAiHQobCgwIBBoIAAAAAgIAAAFqBwoFdGVzdDFyAgAA into stdin then I get an error about an unexpected character, leading me to believe that it is expecting geojson. Would love some guidance on this.

I also am wondering if it is possible to stream the output to stdout in some sort of structured way instead of writing to files/.mbtiles. From the documentation it looks like this may be a lot more complicated/not feasible for me to write, but if there are any known methods to do this that would be most appreciated.

Thanks!

e-n-f commented 5 years ago

My typical usage with streaming input is

echo '{"type":"Polygon","coordinates":[[[0,0],[0,1],[1,1],[1,0],[0,0]]]}' | tippecanoe -zg -e test

or, if I want a different layer name,

echo '{"type":"Polygon","coordinates":[[[0,0],[0,1],[1,1],[1,0],[0,0]]]}' | tippecanoe -zg -l layer -e test

There is currently no support for streaming geobuf, because protozero's interface takes a string or a memory buffer, not a stream, so it can expose submessages as memory slices.

The error message you are seeing is because if the attempt to memory-map the input fails, the regular JSON parser gets a chance to run on the input. This makes sense for memory-mapped JSON files vs JSON streams, but not for geobuf, so I should fix that.

It would certainly be possible to do streaming output of GeoJSON or something similar, replacing the call to mbtiles_write_tile/dir_write_tile in tile.cpp with something that writes to a stream instead. What sort of format would be useful here?

dylrich commented 5 years ago

Thanks for the response! I am not sure what I did wrong originally with streaming in geojson, I'm sure I made some typo and "fixed" it when I tried with the layer json. Will use your method.

Good to know on the geobuf - not a deal breaker at all for me and GeoJSON is of course usable.

For streaming output I basically just want to pipe the pbf tippecanoe produces, along with its XYZ, into a different program which will directly work with the data there and finally write the output to my own database instead. So a JSON with x, y, z, and pbf would work for me. Not sure how you'd handle the tileset metadata, though.

e-n-f commented 5 years ago

Maybe it would be reasonable to write out the tiles in tar format, since that is meant to be a stream? The metadata would be in metadata.json, just like if you are writing tiles to a directory instead of to an mbtiles file.

dylrich commented 5 years ago

tar would certainly work for my purposes!

stdmn commented 5 years ago

@ericfischer Quick follow-up on this:

I'm trying to use AWS Lambda to, in realtime, update a tile directory on S3 (related: #776, #741) from a Postgres database. I've created an AWS Lambda Layer (arn:aws:lambda:us-east-1:003014164496:layer:Mapbox_Tippecanoe-1_34_3:9, pull request coming shortly to add to README) and have been able to successfully run Tippecanoe by using Lambda's/tmp folder. I then read the temporary mvt file and push to S3.

I can foresee this approach being problematic for very large files. What I'd prefer is if there was a way to output each tile as it is generated to stdout that I could manually push to S3. Would this be feasible? Even if the output was the entire directory, I could recursively push each file to S3 but I can't think of another way to skip the /tmp folder step

Example of current (not ideal) process:

var path = require('path');
var exec = require('child_process').exec;
var aws = require('aws-sdk');
var s3 = new aws.S3();
var fs = require('fs');

exports.handler = function(event, context, callback) {
  var exePath = path.resolve(__dirname, '');
  var content;

  function processFile() {
    var params = {
      Body: content,
      Bucket: <BUCKETNAME>,
      Key: 'out.mbtiles'
    };
    s3.putObject(params, function(err, data) {
      if (err) console.log(err, err.stack);
      else console.log(data); 
    });
  }

  exec(
    `/opt/bin/tippecanoe -o /tmp/out.mbtiles -zg --drop-densest-as-needed ./input.geojson`,
    { env: environment, cwd: exePath },
    (error, stdout, stderr) => {
      if (error) {
        callback(error);
      }

      fs.readFile('/tmp/out.mbtiles', function read(err, data) {
        if (err) {
          throw err;
        }
        content = data;
        processFile(); 
      });

      callback(null, stdout);
    }
  );
};
e-n-f commented 5 years ago

If you just want to write each tile to stdout in a way that you can read back in from something further down the pipeline in a streaming way, the easiest thing would be to replace mbtiles_write_tile in mbtiles.cpp with something like this:

void mbtiles_write_tile(sqlite3 *outdb, int z, int tx, int ty, const char *data, int size) {
        printf("%d %d %d ", z, tx, ty);
        for (int i = 0; i < size; i++) {
                printf("%02x", (unsigned char) data[i]);
        }
        printf("\n");
}

For each tile it will write out a line with the zoom, x, and y coordinates, and the hex-encoded content of the tile, which you can then decode back into the tile data in your reader.

Writing a tar file would be quite similar, except that the tar format contains a checksum, and I haven't looked up how to calculate the checksum.

stdmn commented 5 years ago

Awesome! I'll give it a go. If i can come up with a solid solution for a fully Lambda-hosted solution, I'll make a note for future users who are hoping to go Serverless with Tippecanoe.

stdmn commented 5 years ago

@ericfischer OK, so I think I've got just about everything working except I'm trying to figure out the best way to send the hex-encoded data to S3 to create a PBF file. I'm using NodeJS's child_process.spawn() with Tippecanoe and piping the output data. I then take the output string, break by newline and break by space. Three questions:

  1. I seem to be getting 2 hex strings when I output the data. The format is essentially: 'hexCode\n z x y hexCode'. What is the first hex code? Something to do with metadata?
  2. I'm trying to figure out the best way to send each streaming tile as a PBF file to S3. Should I use SQLite (as you do in the original mbtiles.cpp file) to convert and then send the tile?
  3. What code should I edit to allow for streaming tile output from tile-join?
    • As a reminder, the process I'm hoping to reproduce in Lambda is this example

Here's an example output array (from question 2, above, and split by \n for readability):

['c3366249d3da0ab6d45a376457459a633b5f6deb11526f853a78629d168558f35c0a96d4dfad54b399bcded99e0ed1c00fa3eb2f21f9345f2c48b8f4a7942272b70c665312ce6ee6d35910cdfdc507122c23f2d90faefd28f203ea22e7ab3ff183883ac89e4efc1b6a935f0f26358951df11b5c8df0eb550e7b6a6871f149e9eb7e7f123f510ba20179757efc89bf3338a903b09fdbbdbf1783c8c4eace1a561743aa66959b60d8003bac0051020d0031e78e63ceff6dd017c0187e025188123740c5ef54ebc538081fbfbf12d1dc1c1bfee7bcbeb1947b6d13f36916377fbff0198c58e182a020000', 
'12 1204 1542 1f8b08000000000002036590cd6ed34010c7edd8d9d81ba74ed2a64db780b62b84ca01413f4142a8588da111c601272d6a0f584ebc692cb976646f808a0b275ea02790403c011f27ae1ce105fa027d05eeb0767a4062a4917e33daf9cf7f167d01af0a506634652b6f30d2d28051d7f3fd84a629d2d9c998ba9394ba3e1dc43e45ca118ddcac898af1cb8826480963e67a09f5101c518f513f6775101f1f4fdb2a170aa6588e87c360402f8a84322f08a785124d8edd7ee81f2198d1308ce324450a8bd9240a588ad46ccb30892336459f8ed908a94192d0a349e86536bcc8775f7821d2f85006ae97a63e8227d44bdcfe2408d9057b21e3b6d5bce50eb3d1dc604eb9eb8ce4c93888901cf5475c2266239a4c7d22168cc774e045317587fcdc4992fd4cc4ffec842cc0c6eaeae606369cdeee9e831fb52d0b3b1da345aec1ab1ddbc4dd5ec739c037f013d3796cd8a6ddc3567bbf6d3fc44ff7f888e9903254bbbcb6cc07c663a2c38a6959ed2e365bcf0ca7d525eae29f2c6e7ebb4f24fc1b900216788aa4d810841fdb1c5778464482a2cd1f7cbc4764fcf5b3c8f1bbce67dfbfcbe2d736c7f53cceb78906e126dedcba731bafae6d1008951dc7383c585e5e267558fdef0a02f1213fa3851db3db6e71fb6dc3ba8bed4e0ff36b768d5ecfb08902c1beb163d83d02a0dcda31badcc4dbd3023736c78dfc143377079c4e5fffbb7cfd56dd6a4af52d4110c542419264b95804a05454800a60a9ac68a58a3a03f57255ab55ea33b3604e6f54e7c142ad595f0408281fced788a68a673a1241591244bdfe3c579b15e70a0d695e5ee062a004723550061aa834671675545daa2dd5015703b9daa50bb54f5ced8a4acef4beda2c0a825493af03e17253980182280bf31ad0f5bfe11677c732030000' ]

It would be amazing if we could get a new potential option, similar to -o and -e, that would stream the tiles so that they could easily be sent to a remote server. Something like --output-to-stream

e-n-f commented 5 years ago

It's not intentional that there are two hex strings in the line, so maybe there is some multithreaded locking problem going on. The format of each line is intended to just be

zoom x y hexstring

Do the hex strings decode to valid tiles for you?

The use of sqlite is just to package the tiles into mbtiles format. If you are not using mbtiles, you could use the aws s3 cp command or something like that to copy the file for the tile to s3.

Tile-join also calls mbtiles_write_tile, so this change will make it also write to the standard output instead of creating a tileset.

I agree that an --output-to-stream option would be the right way to do this instead of editing the code. If this issue results in a generally-useful output format that many programs will be able to take advantage of, I will turn it into a real option.

stdmn commented 5 years ago

It's not intentional that there are two hex strings in the line, so maybe there is some multithreaded locking problem going on

It looks like every time I receive 2 additional hex codes (unrelated to the actual tiles). Example:

Do the hex strings decode to valid tiles for you?

Seems to be working. I ended up converting the hex code to a buffer using Javascript's Buffer.from() command.

Is there an easy way to also run the metadata?

I agree that an --output-to-stream option would be the right way to do this instead of editing the code. If this issue results in a generally-useful output format that many programs will be able to take advantage of, I will turn it into a real option.

It seems to me that this would be a nice agnostic way to push to any sort of database (see #87, #751, #741, et al). Especially as the world moves increasingly towards serverless, I think that this will be an invaluable option for most.

In my mind, the best format would be to output each tile as you have above (z x y hex; alternatively, hex could be a buffer) and also output the metadata. My guess is that this should be sufficient for most.

For reference, here's an example of my finalized code for creating tiles in S3 using Lambda from a separate event trigger for S3. The function also uses a Lambda Layer arn:aws:lambda:us-east-1:003014164496:layer:Mapbox_Tippecanoe-1_34_3:9 generated from the modified code above. Note that this is fully streaming on both input and output:

exports.handler = function(event, context, callback) {
  var srcBucket = <BUCKETNAME>
  var srcKey = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));

  function processFile(buf, zxy) {
    const prefix  = 'main/test';
    const filePath = `${path.join(prefix, ...zxy)}.pbf`;
    const content = Buffer.from(buf, 'hex');
    var params = {
      Body: content,
      Bucket: <BUCKETNAME>,
      Key: filePath,
      ContentEncoding: 'gzip'
    };
    s3.putObject(params, function(err, data) {
      if (err) console.log(err, err.stack);
      else console.log(data);
    });
  }

  const tippecanoe = '/opt/bin/tippecanoe';
  const tippArgs = '-f -z14 -l test -o /tmp/test.mbtiles'.split(' ');
  const tipp = spawn(tippecanoe, tippArgs, { shell: true });

  tipp.stdout.on('data', data => {
    var arr = data.toString().split('\n');
    arr.forEach(string => {
      var output = string.split(' ');
      if (output.length > 1) {
        const buf = output.pop();
        processFile(buf, output);
      }
    });
  });

  tipp.stderr.on('data', data => {
    // pass; *NOTE: this is where the progress prints out, 
    // so if you want to capture that in a Cloudwatch log, 
    // you should throw a console.log here
  });

  tipp.on('close', code => {
    if (code !== 0) {
      console.log(`tipp process exited with code ${code}`);
    }
  });

  s3.getObject({
    Bucket: srcBucket,
    Key: srcKey
  })
    .createReadStream()
    .pipe(tipp.stdin);
};
stdmn commented 5 years ago

One more specific use case where --output-to-stream would be useful:

If I try to recreate this on AWS Lambda, I have to choose whether to use the original output to .mbtiles or the updated code above for streaming.

My ideal workflow would be:

stdmn commented 5 years ago

@ericfischer Quick question: re solution above:

How do I generate metadata? Based on my ignorant glance at the code, it looks like the metadata creation step requires a sqlite database to be created (which the above code skips). Is that correct? Any tips on a succinct way to stream the output while also creating a metadata file?

e-n-f commented 5 years ago

It really should have been somewhere in dirtiles.cpp, but there is code to write the metadata as JSON in mbtiles.cpp:

https://github.com/mapbox/tippecanoe/blob/d96b521570dd9af522349d753368e0faecbb4243/mbtiles.cpp#L493

In this situation it writes the metadata into a temporary sqlite database, allocated here:

https://github.com/mapbox/tippecanoe/blob/d96b521570dd9af522349d753368e0faecbb4243/mbtiles.cpp#L277

and then copies the metadata from that table into the metadata.json file.

stdmn commented 5 years ago

Gotcha. So in order to output the metadata to a stream, I'll need to get rid of if (outdir != NULL) and add an echo? Sorry again for the basic question--not really up to speed on C++.

EDIT: I suppose I could also just output to the Lambda tmp folder and then copy from there as well.

e-n-f commented 5 years ago

Sorry it's not clearer, but yes, getting rid of the if (outdir != NULL) and then changing the code inside to write to wherever you actually want the metadata instead to the fp that is being opened to metadata.json will be the way to do it.

Were you able to get straightened out whatever was causing the extra hex codes?

stdmn commented 5 years ago

Were you able to get straightened out whatever was causing the extra hex codes?

I wasn't but it doesn't seem to be a problem. I was able to get my serverless workflow working (although a bit hack-y but doesn't seem to be a problem so far) which was the main goal.

Thanks for the help. I'll see if I can figure out how to stream the metadata.

e-n-f commented 5 years ago

Great, I hope it works!

I think I'm going to go ahead and put a branch that writes to tar format (including the metadata), since that seems like it might be a useful generalization even if it's not quite what you want.

e-n-f commented 5 years ago

https://github.com/mapbox/tippecanoe/pull/789 adds the --output-to-tar option, which I hope will also provide a good example of how to handle other streaming output types.

tolgaozkan commented 5 years ago

Were you able to get straightened out whatever was causing the extra hex codes?

I wasn't but it doesn't seem to be a problem. I was able to get my serverless workflow working (although a bit hack-y but doesn't seem to be a problem so far) which was the main goal.

Thanks for the help. I'll see if I can figure out how to stream the metadata.

@stdmn I wanted to try your Tippecanoe lambda layer, according to your readme. I am receiving You are not authorized to perform: lambda:GetLayerVersion on resource: arn:aws:lambda:us-east-1:003014164496:layer:Mapbox_Tippecanoe-1_34_3:3 error. I think you need to add this permissions to your lambda layers.. And is this the latest stable version? I have seen different versions on top of this topic.

stdmn commented 5 years ago

Try the new version: arn:aws:lambda:us-east-1:003014164496:layer:Mapbox_Tippecanoe-1_34_3:9

Quick note: I've done a frankenstein and created a version that includes two new functions, tippecanoe-stream and tile-join-stream, that stream the output instead of writing to files. You can still use tippecanoe and tile-join if you want to write to files.

Quick note 2: This layer works on

tolgaozkan commented 5 years ago

Unfortunately that version didn't work as well but I managed to run it myself. Probably you need to assign additional policies that will make it available publicly. Anyway thank you for your response.

kylebarron commented 4 years ago

In case anyone else is interested I made a Dockerfile setup for creating a Tippecanoe lambda layer here: https://github.com/kylebarron/tippecanoe-lambda. I also published the layer to a few U.S. regions and I think made it public.