mbloch / mapshaper

Tools for editing Shapefile, GeoJSON, TopoJSON and CSV files
http://mapshaper.org
Other
3.78k stars 535 forks source link

[feature request] support gz compressed GeoJSONs #522

Closed indus closed 1 year ago

indus commented 2 years ago

Would be nice to support gz compressed GeoJSONs for input and output for bigger datasets. When using streams it could look something like this:

if (this.options.gz) {
    outStream = zlib.createGzip();
    outStream.on('error', (err: any) => console.log(err.stack));
    let writeStream = fs.createWriteStream(`${file}.gz`);
    outStream.pipe(writeStream);
} else {
    outStream = fs.createWriteStream(file);
}

358 #501

chapmanjacobd commented 2 years ago

It would probably be more useful to support FlatGeobuf ~which can be a lot smaller~ than geojson.gz

edit: FlatGeobuf has no built-in compression but it has a built-in spatial index and it streams well so performance is likely better in many cases. Another nice thing about FlatGeobuf is that it forces consistency with geometry types

http://switchfromshapefile.org/

indus commented 2 years ago

The tool I feed the data in doesn't support flatgeobuf, but gzipped JSONs. So "more useful" depends.

indus commented 2 years ago

I've changed my mind. tippecanoe now supports FlatGeobuf input. So now this would be useful for me as well ;-)

ThomasG77 commented 2 years ago

A demo to do it using only command line. It may not fit your use case as not inside mapshaper code itself and need Unix like system

# Get data
wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_2_counties.zip
# uncompress
unzip ne_10m_admin_2_counties.zip
# Convert to GeoJSON
ogr2ogr -f GeoJSON ne_10m_admin_2_counties.geojson ne_10m_admin_2_counties.shp -lco WRITE_NAME=NO -lco RFC7946=YES
# Compress to gz and keep original geojson
gzip -k ne_10m_admin_2_counties.geojson
# Uncompress on the fly gzipped GeoJSON, use - arg for input and output in order to use stdin and stdout and compress result
zcat ne_10m_admin_2_counties.geojson.gz \
        | mapshaper -i - -filter '"ME,VT,NH,MA,CT,RI".indexOf(REGION) > -1' -o - format=geojson \
        | gzip -c filtered.geojson.gz

# Not that you can use GDAL too to uncompress gz e.g instead
# of the "zcat ne_10m_admin_2_counties.geojson.gz" part.
# Commented below
# ogr2ogr -f GeoJSON /vsistdout /vsigzip/ne_10m_admin_2_counties.geojson.gz ne_10m_admin_2_counties.geojson
indus commented 2 years ago

@ThomasG77 I tried something similar on windows but it didn't worked for me. I was unable to get a an output of mapshaper bigger then 2GB. Gzip would bring down the size of the final file but it hasn't helped me with mapshapers limits.

So I've tried to split the output like so:

mapshaper-xl 12GB ./veryLarge_OSM_extract.shp `
<# some processing #>
-each "part = this.id % 5" `
-split part `
-o "./outpath/" extension=.ndjson format=geojson ndjson

This splits the output in multiple parts that can be concatenated easily. But this doesn't work with pipes at all as the output gets closed for every file part and stops the pipe. Another drawback is that you've to add a property to split on. It would be nice if split would allow an expression (like the one shown in each) on its own. But thats not a big issue for me as I can strip the temporary property later on.

indus commented 1 year ago

I just found out that the split function actually supports expressions. So what it boils down to is this:

mapshaper-xl 12GB ./veryLarge_OSM_extract.shp \
# some processing #
-split "this.id % 5" \
-o "./outpath/" extension=.ndjson format=geojson ndjson

I have added a PR to make this clearer in the REFERENCE.

indus commented 1 year ago

@mbloch I was working on GZ support for an hour today before I realized that you have implemented this just yesterday 😅 Thank you!

mbloch commented 1 year ago

@indus I added GZ support in the simplest possible way, and there is room for improvement. The web interface doesn't support .gz files. And mapshaper uncompresses the entire file into memory, which limits uncompressed file size to ~2GB (i think, typically). Mapshaper is able to read uncompressed CSV and JSON files incrementally, which means that you can load larger files if they are uncompressed.

indus commented 1 year ago

I've just saw these drawbacks as well. For writing I find it is a huge improvement to have an option to gunzip the output. Reading is actually limited to 512 MB. Buffer.toString() gives an error when the string gets bigger. 

indus commented 1 year ago

Here is a description of the error: https://cmdcolin.github.io/posts/2021-10-30-spooky https://stackoverflow.com/questions/68230031/cannot-create-a-string-longer-than-0x1fffffe8-characters-in-json-parse

mbloch commented 1 year ago

Of course you're right, it's 512 MB... I'll look into increasing that limit.

mbloch commented 1 year ago

I published an update that increases the maximum uncompressed size of gzipped GeoJSON and CSV files. The new maximum should be around 2GB (the max size of a Buffer in most environments, the last time I checked). After the update, I was able to import a gzipped 1.82GB GeoJSON file.