opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
833 stars 57 forks source link

add support wkt or wkt2 formats for crs #221

Closed achapkowski closed 6 months ago

achapkowski commented 6 months ago

add support wkt or wkt2 formats for crs to provide more robustness for clients who get lots of varied data.

hobu commented 6 months ago

add support wkt or wkt2 formats

Which version? There are several (of each).

Do you mean any version? If so, then you've just imposed all of the (varied, inconsistent, and incompatible) history of WKT onto every implementer of the format.

The case for PROJJSON is very clear:

It is a huge deficiency that the geospatial standards community doesn't have a JSON-based CRS format. The impedance caused by the content of WKT not being expressed in any common grammar has been a huge gate-keeping industry deficiency for decades. The OGC CRS SWG is planning to start with PROJJSON to make a CRSJSON, but who knows what that will devolve into. PROJJSON, however, exists, can have its syntax validated with common tools, and can be conveniently parsed.

#wktrantoff

kylebarron commented 6 months ago

More discussion on PROJJSON was had in https://github.com/opengeospatial/geoparquet/discussions/90 and https://github.com/opengeospatial/geoparquet/pull/96

jiayuasu commented 6 months ago

PROJJSON has no Java implementation or Java binding. This becomes a blocker to Apache Sedona or any big data ecosystem that are in Java / Scala world such as HBase, Trino, Hive and so on

Currently, we have no way to parse or understand PROJJSON but we can understand CRS WKT using GeoTools.

rouault commented 6 months ago

PROJJSON has no Java implementation or Java binding

If it is not already available in it, it shouldn' hopefully be too hard to add to https://github.com/OSGeo/PROJ-JNI which is a JNI binding of PROJ.

Otherwise https://github.com/rouault/projjson_to_wkt could be quickly ported to Java to convert PROJJSON to WKT2 (@m-mohr ported it to JavaScript), but I'm not sure GeoTools understands WKT2. There might be in progress work regarding WKT2:2019 support in https://github.com/apache/sis

rouault commented 6 months ago

If it is not already available in it, it shouldn' hopefully be too hard to add to https://github.com/OSGeo/PROJ-JNI which is a JNI binding of PROJ.

well, I was forgetting that you could also use the GDAL JNI bindings to convert PROJJSON to WKT1 using https://gdal.org/java/org/gdal/osr/SpatialReference.html#SetFromUserInput(java.lang.String) to import PROJJSON and https://gdal.org/java/org/gdal/osr/SpatialReference.html#ExportToWkt() to export to WKT, using PROJ underneath. Of course that's a bit of a heavy dependency

paleolimbot commented 6 months ago

JNI is a non-starter for many Java libraries in the big data ecosystem, let alone PROJ via JNI. For PROJJSON to be a possibility in that ecosystem somebody would probably need to step up and do the implementation work in Java (as Even noted, it might be not be too difficult and there is some readily available prior art to draw from).

In the absence of that, excluding an entire ecosystem seems worse than allowing a widely supported CRS representation into our metadata.

m-mohr commented 6 months ago

The conversion work from Python to JS was 1 hour of work with ChatGPT. It's likely not much more in Java. If that's too hard to do, then the ecosystem doesn't really seem to want it, I'd say?

Anyway, if we add other encodings, please only additive, not instead of PROJJSON. Otherwise you also exclude non-WKT2 supporting ecosystems again.

Also, can we clarify whether Java supports WKT1 or 2? That's quite a difference...

rouault commented 6 months ago

Also, can we clarify whether Java supports WKT1 or 2? That's quite a difference...

I believe GeoTools supports WKT1 only AFAIK: https://docs.geotools.org/stable/javadocs/org/geotools/api/referencing/doc-files/WKT.html Apache SIS supports WKT2:2015 (and WKT1), with in-progress work to add WKT2:2019.

jiayuasu commented 6 months ago

Thanks guys for the help. So I guess the solution for us is:

  1. Sedona side will implement a Java version of the https://github.com/rouault/projjson_to_wkt . It converts projjson string to WKT1/WKT2:2019
  2. We will use GeoTools WKT1 for now. When Apache SIS finishes WKT2:2019, we will migrate to WKT2:2019.

But this just solves the reading projjson problem. How about writing a WKT1 / WKT2 string to projjson?

rouault commented 6 months ago
  1. It converts projjson string <> WKT1/WKT2:2019

projjson_to_wkt has this important warning "Warning: while the export to WKT1 should be syntaxically correct, datum, projection method or parameter names will be the one of WKT2, and thus a number of implementations will in practice fail to understand such WKT1 strings."

achapkowski commented 6 months ago

Not everyone uses projjson or the associated tools. Many people are in the ArcGIS space.

rouault commented 6 months ago

Many people are in the ArcGIS space.

https://www.esri.com/content/dam/esrisites/en-us/media/legal/open-source-acknowledgements/arcgis-pro-3-3-open-source-disclosure.zip has a ArcGIS Pro 3_3 Open Source Disclosure.xlsx file mentioning a "proj_gdal_e.dll" file. Time to make active use of it ;-)

TomAugspurger commented 6 months ago

Anyway, if we add other encodings, please only additive, not instead of PROJJSON.

Concretely, would this mean that certain geoparquet readers couldn't read certain geoparquet files, if the reader doesn't happen to implement projjson support? I'd worry about that causing an (IMO unnecessary) schism and confusing users and data providers.

m-mohr commented 6 months ago

Yeah, if the other encodings are not additive. That makes it more difficult for writers though, but I feel like ease of reading is more important than ease of writing?

Ideally everyone would support PROJJSON though.

hobu commented 6 months ago

Not everyone uses projjson or the associated tools. Many people are in the ArcGIS space.

It is easy to install and use PROJ from an ArcPro Conda environment. It works quite well.

Concretely, would this mean that certain geoparquet readers couldn't read certain geoparquet files, if the reader doesn't happen to implement projjson support?

If the specification allows multiple flavors of CRS, most writers will chose vanilla – raw EPSG codes. That means readers will have to go somewhere else to get the parameters those codes describe. Or they will always use the one code that everyone knows and can describe by heart, 4326 😄

The case against PROJJSON so far is:

What's missing here is these languages don't have a complete open source implementation of the data model that describes WKT2, which is published in ISO 19162 and OGC 18-010. They're missing because writing one is a ton of detailed, thankless work to implement a complex and necessarily complicated data model. PROJJSON is a very faithful expression of that model in JSON, and @rouault found many interpretation nits and bugs in the specification as he built PROJJSON because of its complexity.

Maybe a transpile of the full PROJ engine to WASM is within reach. Maybe Apache SIS has a full 19162 model ready to go but just needs the PROJJSON i/o built for it. I don't have the answers here, but it seems to me users in those software ecosystems need to strengthen their capabilities to meet the requirement regardless of whether or not geoparquet requires PROJJSON or allows every flavor of WKT to describe the coordinate system of data.

PROJJSON is advantageous because it can meet data readers half way – if users have a full interpretation engine they can use it. If they don't, they can pluck the keys and codes that they know about without writing a custom parser and interpretation engine.

achapkowski commented 6 months ago

Not everyone uses projjson or the associated tools. Many people are in the ArcGIS space.

It is easy to install and use PROJ from an ArcPro Conda environment. It works quite well.

You obviously never worked in closed secure environments. Not everyone can pip or conda install stuff.

hobu commented 6 months ago

Not everyone uses projjson or the associated tools. Many people are in the ArcGIS space.

It is easy to install and use PROJ from an ArcPro Conda environment. It works quite well.

You obviously never worked in closed secure environments. Not everyone can pip or conda install stuff.

https://anaconda.org/esri/proj4 it seems like Esri is already explicitly supporting PROJ usage?

Anyway, I do not see "Esri doesn't support it (yet)" as a valid argument against it.

jorisvandenbossche commented 6 months ago

PROJJSON is advantageous because it can meet data readers half way – if users have a full interpretation engine they can use it. If they don't, they can pluck the keys and codes that they know about without writing a custom parser and interpretation engine.

I think this is an important point that @hobu makes. We actually have an example of that in the spec specifically for OGC:CRS84 (https://github.com/opengeospatial/geoparquet/blob/v1.0.0/format-specs/geoparquet.md#ogccrs84-details), but I think that should apply more in general (with the only requirement that the files were created by a writer that includes those codes).

achapkowski commented 6 months ago

Since proj supports multiple formats. https://proj.org/en/9.4/faq.html

I don't understand why people are being stubborn about the format.

hobu commented 6 months ago

I don't understand why people are being stubborn about the format.

Because writing a specification that diverse implementation audiences can succeed with is very difficult. Most of the non-geo software world has no clue what WKT is or knows how to dereference an EPSG code into a coordinate system and they don't ever care to. Geoparquet aspires a much wider audience than the spatial-is-special crowd, and it needs implementation buy-in in these other communities to get traction beyond it. Larding up the specification with conveniences like allowing many different coordinate system description formats makes it harder to provide complete implementations and increases the interoperability leakage between those implementations.

I would argue that the spatial-is-special world's two most impactful specifications, Shapefile and GeoJSON, could attribute a lot of their market penetration to the fact they don't provide much guidance in regard to coordinate systems. By not imposing that complexity on implementers, they focused on the part of the interoperability that matters – the geometries. I argue the same thirst exists in the communities that would also implement geoparquet.

nyalldawson commented 6 months ago

@achapkowski while you're active in the open source community, mind getting someone at ESRI to comment on https://github.com/OSGeo/gdal/pull/9980 ? Having a open driver for this format benefits everyone, ESRI included. 👍

urschrei commented 6 months ago

Neither does Rust

@hobu We (the georust greater co-prosperity sphere) have good bindings to libproj if it can be used. And if it can't, we'll write a native implementation.

cholmes commented 6 months ago

Great discussion everyone - I think I'm going to close this issue soon as we discussed extensively before 1.0, and I think we've gone over most of the points again. I think we can all acknowledge that our choice of PROJJSON was our most 'controversial' choice in the specification, but I don't think we'll revisit that until a '2.0' version of GeoParquet.

And having 'multiple' options (PROJJSON plus WKT2 for example) that impose higher requirements on readers, forcing them to understand both dialects if they want to read any possible GeoParquet format, is not something desired for GeoParquet. Philosophically this is not in line with the choices we've made for this format - we want to make it as easy as possible for implementations to be created without a deep stack of geospatial software behind it.

I do think we should continue to work to encourage and even find funding for software that does not yet understand PROJJSON, especially open source implementations. And I will state that we actively want ESRI to implement GeoParquet fully, and the stubbornness on this particular issue is in service of greater interoperability. But until it's fully implemented it seems fine to me for ESRI to just support lat/long, or to use 'most' of GeoParquet and do their own crs metadata that is WKT2 as a bridge.

jiayuasu commented 6 months ago

Ping Apache SIS core developer @desruisseaux since he is much more knowledgable than me on this 😁:

Is PROJJSON support on Apache SIS's roadmap?

Chris, please feel free to close the issue since this is off the topic :-)

desruisseaux commented 6 months ago

Even is correct, Apache SIS supports WKT 1 and WKT 2:2015 (it was the first open source software to support WKT 2 after the ESRI prototype) with work for WKT 2:2019 in progress right now. It also supports GML, which is currently the only format capable to support fully the ISO 19111:2007 model. If I understood correctly, PROJJSON doesn't cover fully the ISO 19111 model yet, which is one reason why OGC wants to review it before to approve a JSON format. If we want CRSJSON to be a replacement for GML, then it should be at least as capable as GML.

I plan to support OGC CRSJSON in Apache SIS when the specification will be advanced enough. Whether SIS will support PROJJSON will depend on whether there is a lot of differences. Note that the OGC CRS working group has explicitly stated in their charter that they will avoid any unnecessary difference with PROJJSON.

One correction to what has been said in a previous comment: WKT 2 is not a data model. The model is ISO 19111, and WKT is an encoding of that model. Libraries do not implement a WKT model. They implement ISO 19111, then establish a mapping from WKT elements to that model. This is what both Apache SIS and PROJ C++ API do. One reason for the WKT complexity is that its mapping to ISO 19111 is not straightforward, as WKT makes compromises in an attempt to be more compact and for backward compatibility. The consequence is that trying to understand WKT without prior knowledge of ISO 19111 is confusing. For understanding WKT, ISO 19111 must be read first. If a JSON encoding does a more direct mapping to ISO 19111 elements, it may help to reduce that confusion.

The CRS standardization effort at OGC is lead mainly by Roger Lott. My experience in working with him for more than 10 years is that he is very reliable. When he said that he will do something, he really does, and he is much, much better than me in following the roadmap.

jjimenezshaw commented 5 months ago

Maybe a transpile of the full PROJ engine to WASM is within reach

That would be great. There is already a version of GDAL https://github.com/bugra9/gdal3.js that includes PROJ, so it should be easy to "extract" only the PROJ needed part. The only missing part (but not completely mandatory) is the cURL integration to use the grid files from https://cdn.proj.org Unfortunately this issue is not moving forward: https://github.com/emscripten-core/emscripten/issues/3270