opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
795 stars 56 forks source link

geoparquet coordinate reference system default / format / extension #3

Closed cholmes closed 2 years ago

cholmes commented 2 years ago

There are a lot of options for how we approach coordinate reference systems.

And there's obviously more.

My general take is that we should have a default, and expect most things to use that. But should specify it in a way that it could be an extension in the future. So we shouldn't just say 'everything is 4326' just in the spec, but should have a field that says this field is always 4326 for the core spec, but in the future that field could have other values.

So I think we do the first version with just 4326, and then when people ask for more we can have an extension.

One thing I'm not sure about is whether we should use 'epsg' as the field. EPSG covers most projections people want, but not all. In geopackage they just create a whole srid table to then refer to, so the SRID's used are defined. Usually the full epsg database is included, but then users can add other options.

One potential option would be to follow ogc api - features and use URI's. I'm not sure how widely accepted that approach is, like if the full epsg database is already referenced online. So instead of 'epsg' as the field we'd have 'crs', and it's a string URI.

cholmes commented 2 years ago

Some progress on this - I like the geo-arrow approach on this, and since it already has set some precedent I say we start with it. They went with just defining a crs field, and using the actual crs well known text string. I like it because it doesn't limit us to epsg codes, as there is real world data that does not use epsg.

But in #4 I did narrow it a bit. They allow anything that proj understands, but I don't think we should assume all software using a general spec uses proj. So I just did wkt2, which is compatible with what they did, just narrower.

I also went with a strongly recommended default of the CRS of 4326. I thought about making it a hard-coded requirement, and then doing a 'crs extension', and would be open to that. But hopefully saying 'strongly recommended' encourages most people to just use the default. I do want some feedback from the cloud data warehouses though, on their geodetic spheres and if it'd be better to make use of that.

Also note I'm not super deep on CRS stuff, so I'm not sure if I'm referencing the exact right spec, and not 100% sure my CRS text is exactly right. I grabbed it from https://spatialreference.org/ref/epsg/4326/ and assumed their 'wkt' is actually wkt2. I do think it's better to link to the ogc spec than the iso one, since you can actually link to it. And I think the iso year updates are just that they 'review' it, and the ogc spec remains up to date. But if anyone knows more than me here do sound in.

jorisvandenbossche commented 2 years ago

So I just did wkt2, which is compatible with what they did, just narrower.

Narrowing it down to WKT2 sounds good. PROJ actually has different flavors of WKT2 (https://proj.org/development/reference/cpp/io.html#_CPPv4N5osgeo4proj2io12WKTFormatter11Convention_E). Following the docs, there has been an update to the ISO standard. Although the differences between both might be small enough to ignore for our purpose (I am no expert on this).

I also went with a strongly recommended default of the CRS of 4326. ... But hopefully saying 'strongly recommended' encourages most people to just use the default. I do want some feedback from the cloud data warehouses though, on their geodetic spheres and if it'd be better to make use of that.

Personally, I don't see the need for such a "strong" recommendation. I am fine with it being recommended, in the sense that I don't care too much, as I suppose in GeoPandas we will honor the CRS of the data that is being written to a parquet file, and thus it is up to the user to decide on this anyway whether they follow the recommendation or not.

But to explain this from a GeoPandas point of view: parquet is meant (among other things) as a fast and efficient file storage to be able to work with large datasets. And so IMO people should store their data in the CRS they will most likely need to do their analyses, to improve the performance of loading the data (avoiding a reprojection on loading of the data). When working with relatively local data (so not global or continent-wise datasets), and when doing typical spatial operations, you will most of the time want to use a projected CRS (at least in GeoPandas, since we assume coordinates are in a cartesian plane, and don't have native support for geodetic/spherical operations like https://s2geometry.io/ does). This trade-off can of course be different for other use cases / systems (that might have better support for geographical coordinates).

cholmes commented 2 years ago

Cool, thanks for the explanation on the GeoPandas point of view. I could be ok with not making it 'strong'. The main thing for me is that I want this to be accessible to people who don't know geo, and 'projections' are a pretty advanced topic. So to me that's mostly there to say 'if you don't know what your data should be then just use this'.

EFT-Defra commented 2 years ago

Hi @cholmes, I'm a data engineer at the Department of Environment, Food, and Rural Affairs in the UK. We're currently in the process of moving our spatial data processing and analysis into the cloud, where we're hoping that Spark will help us to exploit data sources that were previously beyond the capabilities of our aging IT infrastructure. As such we've been watching the work that @jorisvandenbossche and the GeoPandas' team have been doing on geoparquet with anticipation, so we're really pleased that you're all working together on this. Is there anyway we can help or contribute?

paleolimbot commented 2 years ago

Personally, I don't see the need for such a "strong" recommendation.

I agree with Joris that a recommending a default CRS is not needed, and that encouraging the use of a suitable projected coordinate system will probably improve speed (fewer reprojections) and accuracy (because most geometry engines are designed to work with projected coordinates).

It's also worth noting that QGIS just started warning users about the default WGS84 ellipsoid (on which EPSG:4326 is based) since it can only guarantee an accuracy of ~2m as many continents have moved since 1984 (this accuracy is fine for many, but not for some). There is also an axis order problem with EPSG:4326 in that some standards mandate interpreting those coordinates as longitude, latitude and others mandate interpreting them as latitude, longitude (e.g., the default in PROJ but overridden by many libraries that use PROJ). A safer alternative is OGC:CRS84, which has the same axis order interpretation in all standards.

cholmes commented 2 years ago

@EFT-Defra - I'm super sorry for the super slow response on your query, I somehow didn't see it until a couple of days ago. It would be great to help out! I think the easiest way to help out would be to make some data available for others in any version of a geoparquet spec. So testing out the tools to help it, and give others test data. And then of course contributing to any actual spec.

At my company (Planet) we're getting some more resources to start on this, so I'm hoping to gather a group on it soon.

cholmes commented 2 years ago

@paleolimbot - I think I'm good with OGC:CRS84 - you make good points about different standards (I hadn't heard of that specifically, but I'd believe it since lat long ordering is a mess - which are the ones that do it differently?)

I do agree that using a suitable projected coordinate system will improve speed and accuracy when used right. But I am not sure it'll always be used right / selected properly. There are a lot of parquet users who don't know anything about geospatial, so I think we should make some recommendation for what they should use. I'm not at all attached to which one. But I don't think the core spec should make people understand projections.

paleolimbot commented 2 years ago

For sure! It seems like the main thing is to keep users from putting longitude, latitude (which users are often handed in some way) into a parquet file without labelling it as such via a CRS. I think that may be easier to solve at the implementation level rather than the spec level though. It's rare in R to get handed a dataset without a CRS already attached, so in that case a parquet/Arrow writer would just pass on that CRS (converting to WKT2 if it wasn't that way already). If there was no CRS present, a parquet/Arrow writer should probably error by default with a message that tells somebody what the CRS is for lon/lat since that's probably what they have?

There are a mess of CRSes whose common interpretation have a flipped axis order compared to the official EPSG definition. The most common three have official OGC definitions with the axis order defined in accordance with the common interpretation (OGC:CRS84 == WGS84, OGC:CRS83 == NAD83, OGC:CRS27 == NAD27). I'm not an expert but it came up when I was writing a PROJ wrapper for R (where I had to deal with the fact that the default in PROJ is to respect the authority axis order definition).

A note that you can write a prototype Parquet file for any GDAL-readable data source via the R geoarrow package. The actual files that it writes will change according to the discussions in the geopandas/geo-arrow-spec repo but the basic structure will probably stay the same. There's also a guide on how to loop over geoarrow geometry in C and C++ (again, will change according to the discussions but the basic nested list structure is likely to stay the same).

``` r # remotes::install_github("paleolimbot/geoarrow") library(geoarrow) library(sf) #> Linking to GEOS 3.9.1, GDAL 3.2.3, PROJ 7.2.1 file <- system.file("shape/nc.shp", package = "sf") df <- read_sf(file) parquet_output <- "nc.parquet" write_geoarrow_parquet(df, parquet_output) read_geoarrow_parquet_sf(parquet_output) #> Simple feature collection with 100 features and 14 fields #> Geometry type: MULTIPOLYGON #> Dimension: XY #> Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965 #> Geodetic CRS: NAD27 #> # A tibble: 100 × 15 #> AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74 #> #> 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10 #> 2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10 #> 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208 #> 4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123 #> 5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066 #> 6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954 #> 7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115 #> 8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254 #> 9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748 #> 10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160 #> # … with 90 more rows, and 4 more variables: BIR79 , SID79 , #> # NWBIR79 , geometry ``` Created on 2022-02-20 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)
cholmes commented 2 years ago

It seems like the main thing is to keep users from putting longitude, latitude (which users are often handed in some way) into a parquet file without labelling it as such via a CRS.

Fully agree.

I think that may be easier to solve at the implementation level rather than the spec level though.

Definitely. Though I think it'd be good to have an early 'validator' that will raise errors if it encounters geo data that is not labeled with a CRS. That could help guide implementations, instead of us trying to 'catch' every possible implementation.

If there was no CRS present, a parquet/Arrow writer should probably error by default with a message that tells somebody what the CRS is for lon/lat since that's probably what they have?

Yeah, I like this. Could give them some hints to help verify that their data is lon/lat before they naively stick it in. Maybe it's not in the spec itself, but a 'best practice' recommending implementations do this.

A note that you can write a prototype Parquet file for any GDAL-readable data source via the R geoarrow package

Oh awesome!

alasarr commented 2 years ago

Thanks to everybody for the work on moving this forward. I agree with the overall idea, and solving the default CRS at the implementation level rather than the spec level though.

In my experience with cloud vendors, I think would be better to include always a CRS. Today it's a little bit messy because some magic appears due to the absence of a CRS and with the mix between geographies and geometries. BigQuery only supports spherical coordinates, if you upload data in WKT it assumes the data uses WGS84 coordinates with spherical edges, but if you load data in GeoJSON format it assumes it's in planar edges and it makes a conversion. On the other hand, Snowflake treats GeoJSON as JSON-formatted WKT with spherical semantics. I've seen lots of issues around this problem during the last year. If the data includes the CRS, the conversion from planar edges to spherical edges will be implicit and lots of issues will disappear.

I think those warehouses will include support for geometry types beyond their current geography and it'll become more important for them.

I agree with @jorisvandenbossche that it can also be a boost in performance since knowing the CRS can save projections operations.

EFT-Defra commented 2 years ago

I think the easiest way to help out would be to make some data available for others in any version of a geoparquet spec.

Hi @cholmes, the problem we have there is that it's not possible to write geoparquet directly from pyspark as pyspark doesn't allow you manipulate schema-level metadata, only column-level metadata. We have a work around that involves encoding geometries as WKB, writing to vanilla parquet, grabbing the first part of the partitioned dataset with pyarrow, updating the metadata, then re-writing the first part.

TBH, this is such a faff that we tend to just convert all datasets to British National Gird (EPSG:27700), as most of the major ones we use (such as OS MasterMap) already are in that projection, convert all geometries to WKB, and keep everything in vanilla parquet.

cholmes commented 2 years ago

@EFT-Defra - very interesting. Is that common that tools only let you manipulate schema-level metadata? Is the same thing true for other spark tools? It seems like the thing to do would be to improve pyspark to manipulate schema-level metadata. I'm also curious if there's other tools you could potentially use to write out a geoparquet from spark? That might be easier to improve to write out schema-level metadata.

cc @echeipesh @pomadchin - do you all have experience with pyspark? And thoughts on how to get it to play nicely with geoparquet?

Also, we're gathering a group to co-work on GeoParquet to hopefully get to a 0.1 release next tuesday at 15:00 UTC. Send me an email at cholmes at planet dot com if you are interested in attending, all are welcome.

jorisvandenbossche commented 2 years ago

Notes from the call:

What format to allow for coordinate system? Just wkt? Authority codes?

alasarr commented 2 years ago

Closing after merge https://github.com/opengeospatial/geoparquet/pull/25