project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.33k stars 585 forks source link

Add Best Practices for Geospatial Data Standards (KML, KMZ files) #364

Open Jelfff opened 9 years ago

Jelfff commented 9 years ago

I am an online map developer working with the Google map API and am outside of government looking in.

Lots of agencies produce KML/KMZ files. A well-prepared KML/KMZ file is easy to display with either Plain Old Google Maps (POGM) or the Google map API. However, far too often the story that these files would tell remains untold when they fail to display on Google maps.

Suggestion: There should be a policy someplace saying that unless the 3D features of Google Earth are really needed in order to properly present the data, KML/KMZ files should comply with the 2D Google map specifications Google has published at: https://developers.google.com/kml/documentation/mapsSupport https://developers.google.com/kml/documentation/kmlelementsinmaps.

Note: Per a Google engineer, the “Maximum number of total document-wide features” is much higher than 1,000 for the Google map API.

In my experience the most common reason that a KML/KMZ fails to display on Google maps is because the file is full of what I refer to as “junk” KML tags. I define a junk KML tag as one that (1) is 3D-specific and (2) is not needed to convey the data’s story to the user. These junk tags often break the KML parser used by Google maps (despite Google’s claim that such KML tags are ignored by its parser).

Another common problem is KML/KMZ files that exceed Google’s size limits. (See above links.) I have been successful at solving this problem by (1) breaking a large dataset into smaller KML/KMZs, (2) hosting these files on a reasonably fast server, (3) making a small KML file that uses a networklink tag for each data file and (4) displaying the small KML file.

It is super quick and easy to test a KML/KMZ to see if Google maps can display it. Just put the file online and paste its URL over the underline in the following: https://www.google.com/maps?output=classic&q=______________

And just to show you that the above technique does, in fact, work, here is a KMZ file from the fish and wildlife folks that displays fine. https://www.google.com/maps?output=classic&q=http://www.fws.gov/gis/data/national/FWS_CMT_locations.kmz

However, if you use POGM to test your files then you are not going to see the actual error message that the KML parser sends to POGM when something goes awry with your file. To solve that problem people can use the Gmap4 enhanced Google map viewer I developed. When Google’s KML parser sends an error message (which is in all caps) Gmap4 shows it to you. Below is the syntax for using Gmap4. Put your KML/KMZ file online and replace the underline with the link to your file. http://www.mappingsupport.com/p/gmap4.php?q=___________

Suggestion: There should also be a policy to require that KML/KMZ files pass a syntax validation check. There is a free online validation tool at http://kmlvalidator.com/home.htm. But note that some files will be too large for this validation tool.

Suggestion: There needs to be a policy telling everyone to not zip KMZ files! The definition of a KMZ file is a KML file that has been compressed. I make my KMZ files by compressing the KML into a zip file and then changing the extension from zip to kmz. For example, the following is silliness: http://www.epa.gov/waters/data/beach_act_kmz.zip

Below are just a few examples of KMZ files that do not display or do not display correctly with the Google map API for various reasons. Many more examples can be found on federal sites.

Too big: http://www.mappingsupport.com/p/gmap4.php?q=http://www.srh.noaa.gov/gis/kml/hurricanetrack/Atlantic%20Hurricanes.kmz

Invalid file: http://www.mappingsupport.com/p/gmap4.php?q=http://www.epa.gov/enviro/html/frs_demo/geospatial_data/region_02.kmz

Map legend repeats: http://www.mappingsupport.com/p/gmap4.php?q=http://maritimeboundaries.noaa.gov/downloads/USMaritimeLimitsAndBoundariesKML.kmz

Joseph Elfelt

nsinai commented 9 years ago

Joseph, thanks for the input! Where does geoJSON files fit into your thinking?

Jelfff commented 9 years ago

Time for me to 'fess up that I know next to nothing about geoJSON for the simple reason that my work developing online maps is based on the Google map API which does not have any built-in ability to display geoJSON.

Edit: Opps! My mistake. The Google map API does support geoJSON. But I have never had any need to make use of this feature. Thus I cannot shed any light on how well (or poorly) federally produced geoJSON files can be imported by that API.

Joseph

nsinai commented 9 years ago

Got it. Thanks!

rebeccawilliams commented 8 years ago

@Jelfff FDGC is referenced here: https://project-open-data.cio.gov/open-standards/ While open data certainly includes geospatial data, I think the policy goals you have listed above probably make the most sense via OMB A-16: https://www.fgdc.gov/policyandplanning/a-16 and FDGC (and of course reporting issues via Data.gov)

Do you think the open formats for Geospatial Data listed here would be a good reference to add to Project Open Data: http://simpleopendata.com/

Jelfff commented 8 years ago

Thanks for the ideas Rebecca. I looked at OMB A-16 which then referred me to OMB-something-else which put me to sleep. Sorry.

On the other hand your link to http://simpleopendata.com/ was great! Here is my post for those folks: https://github.com/tmcw/simpleopendata/issues/21 My goal to educate people about the vast amount of open federal data available today via ArcGIS services seems to fit perfectly with the message on the simpleopendata page.

Joseph

rebeccawilliams commented 8 years ago

:+1: Renaming this issue Add Best Practices for Geospatial Data Standards.

Jelfff commented 8 years ago

Great rename. Thanks.

JJediny commented 8 years ago

Geospatial Data Abstraction Library is a conversion library for converting both Raster and Vector based geospatial data. In all there are over 142 drivers for Raster formats and 84 drivers for Vector formats.

Raster formats (i.e. Images or Grids)

For most general use Raster or Image formats (i.e. none 4 or 5 dimensional datasets) Geotiff can handle nearly all needs. However for High-resolution imagery Geotiffs can become extremely large (ex. 6-9 GB for 30 meter resolution of CONUS) so most larger Raster files are published as Erdas Imagine which translates and stores imagery data into numeric grids. So I'd say Raster isn't the main issue but Geotiff should be the default...

Vector formats (i.e. Geometry - Points/Lines/Polygons and their Attributes)

Vector files are not nearly as straight forward because of the pros/cons unique to each format. For "light weight" (i.e. less then ~250mb) the most versatile Vector format is by far and away GeoJSON (but there are still concerns with it). For "medium weight" (250mb up to 1 GB) Shapefiles, KML, GPX, CSV are the most commonly used formats for publishing these are Tabular or XML based formats able to be read/wrote at these sizes (_Note to reference each format pros and cons in the future). For "heavy weight" datasets or a collection of datasets most are _currently* published as ESRI Geodatabase or SQlite/Geopackage the volume of data handled in these cases calls for a true "Database" with indexing that allows a program to access/update portions of the data rather than crashing when it tries to load in its entirety. However while there are convertion libraries for ESRI Geodatabase the format/spec is proprietary and therefore should not be used to publish Open Data. Shapefiles because on stable at 2GB but in reality they can still be used up to 4GB. Geopackage is being worked on with the hopes of replacing Shapefiles as a customized SQlite database derived from the Spatial-lite that allows SQlite DB to store both raster/vector data as a single collection.
http://www.geopackage.org/

GeoJSON

is unique because it can be accessed through an API or as a single static file, more over unlike most other formats you can store multigeometry collections - in laymen's you can store multiple points, lines, polygons together in a single file each entry can have it's own attributes independently. This versatility is a double edge sword as this can make data quality/conformity nightmare if you are assuming each data entry is comparable to any other - with variable attributes and geometries that would be a bad assumption.

Shapefile

For comparison the most commonly used format still to this day is a Shapefile. Which is not a file itself but a collections of files that cumulatively store the data as Geometry (.shp) Attributes (.dbf), Projection (.prj) and Unicode or how the text is encoded (.shx) you don't have a shapefile without those four files. All Shapefiles should be distributed as WGS 84 or ESPG 4326 the "defacto standard projection" moving away from Datum based projections agencies have legacy issues with. Web mercator is the standard for web maps services like google, bing, mapbox but it is not a good projection for modeling/analysis and most web map clients assume the data will be WGS 84. For high-accuracy datasets the dataset should be published twice once for ease of use using WGS-84 and once for modeling/analysis as Albers Equal Area

JJediny commented 8 years ago

http://www.digitalpreservation.gov/formats/fdd/gis_fdd.shtml http://www.gdal.org/ogr_formats.html http://www.gdal.org/formats_list.html

Online Vector Conversion Tool (using GDAL ogr2ogr): http://ogre.adc4gis.com/

akuckartz commented 8 years ago

Please do not ignore WKT.

migurski commented 8 years ago

The flexibility of GeoJSON is not necessarily helpful. Often, users will need to translate into other formats, and concerns like field name length (limited in shapefile to 10 chars) will gate what they can do. In my experience, zipped shapefiles are typically the most translatable alternative: everything reads and writes them, and will continue to for at least another decade. Character set will be a challenge; UTF-8 is best but ESRI’s default assumption in shapefile is Win-1252.

migurski commented 8 years ago

For GeoTIFF, Schuyler and Chris Schmidt showed that JPEG encoding, YCbCr color space, and WGS84 projection made for an excellent set of space and portability trade offs with aerial imagery.

For other kinds of data like elevation, best to avoid lossy compression.

JJediny commented 8 years ago

@ akuckartz While I agree WellKnownText (WKT) has alot of utility it only stores geometries (POINT/LINE/POLYGON) not the data (i.e. attributes) associated with those geometries. WKT is still commonly used for map widgets to draw an area of interest to query/filter/process or Bounding Boxes. However its use is fading as GeoJSON allows for this kind of geometry "creation" from scratch See GeoJSON.io as an example.

@migurski spot on :+1: conversion between vector formats is a big concern as you can lose alot of the usability. For example Shapefiles limit Attribute names to 10 char which requires attributes needing to be deciphered - at the same time Shapefiles are advanced in that columns can have restrictions on what kind of data can be stored (strong, number, integer, date) and the length of characters permitted. So a bad example of this would be converting a shapefile to a csv file you can loss these constraints/validation - while going from csv to shapefile you can cut off column name thereby losing the ability.

But I agree Shapefiles still have the best record of stability through any workflow - its not so much about which is format is best but sometimes about which format can maintain the data's integretry as it's passed along through the ExtractTransformLoad process. This is part of the reason we starting using Geonode - an Open Source Spatial Data Infrastructure as they had the best approach of requiring all user uploads to be Shapefiles in WGS 84 for vector data and Geotiff in WGS 84. Once uploaded however that same data then becomes consumable in over 10 vector download formats and as a live OGC complaint Web Services.

JJediny commented 8 years ago

@migurski thanks for bringing up compression. When saving raster data you can often adjust the level of compression. While high compression can save space on transit in the form of MBs (dare i speculate GB) the trade off in the additional processing power needed to decompress the imagery on the fly can be counterproductive long term if that file is to be hosted on a geospatial server where it will decrease rendering performance and the amount of concurrent requests that can be handled.