nvkelso / natural-earth-vector

A global, public domain map dataset available at three scales and featuring tightly integrated vector and raster data.
https://www.naturalearthdata.com/
Other
1.78k stars 369 forks source link

Maintain CSV files for attribute details instead of DBF #328

Open nickpeihl opened 4 years ago

nickpeihl commented 4 years ago

This issue was sparked by this comment.

Currently, attributes for (all?) shapefiles are stored in git as binary DBF formats in the ./housekeeping directory The attributes are joined to geometries to create shapefiles via mapshaper commands in the Makefile (example). Unfortunately, using binary files to store attributes makes it impossible to see diffs and thusly much harder to QA pull requests.

So I propose storing the attribute data in CSV files which can be easily diffed and QA'd.

Unfortunately, unlike DBF files, CSV files do not store field types. So if mapshaper tries to join a CSV file it tries to guess the field type based on the data which may result in unwanted field types in the output shapefiles (Integer where String is appropriate).

Mapshaper does have field-types and string-fields parameters on join, but they only support two types of fields: str (String) and num (Number).

GDAL has a concept of *.csvt files which contains the OGRFieldType for CSV files. But mapshaper does not support csvt. The CVST file can be created from the DBF file using GDAL (example: ogr2ogr -f CSV -lco CREATE_CSVT=YES ./housekeeping/ne_admin_0_details_level_5_disputed.csv ./housekeeping/ne_admin_0_details_level_5_disputed.dbf)

One possible solution is to create an intermediary DBF file using ogr2ogr from the CSV file, then use that intermediary DBF file in the mapshaper join command.

I've created a proof of concept and I'm happy to discuss further or create a PR.

nvkelso commented 4 years ago

Hi @nickpeihl, thanks for posting the POC! I like where this is headed, and thanks for the read into related Mapshaper workflow options.

It's slightly more complicated for the admin0 "level" files because they are themselves derived from an OpenOffice file that includes a few field calculations for backfilling "unknowns" :\

Another thought I've had is now that QGIS supports editing GeoJSON... The current GeoJSON exports could become master and the SHP could instead derive from them. I've had good success in Who's On First doing line delim properties (there's a generic Python exportify script that could be used as starting point) and then smooshing all the geometry into a single line. So it's very easy in Github to look at the property diff, and geometry diff can be viewed using their visual diff tool. We'd need to have a think about if the admin0 and admin0 properties could also be stored in GeoJSON or if the CSV approach you POC'd is better.

I'm in the midst of some COVID changes so don't have a change to look at this in depth likely for another week.