stac-utils / pystac

Python library for working with any SpatioTemporal Asset Catalog (STAC)
https://pystac.readthedocs.io
Other
346 stars 116 forks source link

Use a serialization library #1092

Open gadomski opened 1 year ago

gadomski commented 1 year ago

In PySTAC, we do a lot of work converting STAC objects to and from JSON (to_dict/from_dict) and that process can be error prone and tricky to maintain. As discussed in https://github.com/stac-utils/pystac/issues/1047, the use of a (de)serialization could:

We should explore what it would take to add a serialization library as a dependency and convert our data structures to use it if it makes sense.

Two options (there may be more):

We probably don't want to use pydantic unless it's gotten a lot faster since last checked.

Downsides

By adding a dependency on a serialization library, we move away from our "light, few/no dependencies" model that we've been operating in for v1. Hopefully we can reduce our own code complexity by offloading some work to the serialization library, but we should ensure that the juice is worth the squeeze.

TomAugspurger commented 1 year ago

FYI, I started on https://github.com/TomAugspurger/msgspec-stac/blob/main/msgspec_stac.py a few weeks back but haven't done anything with it since (using msgspec). It seemed pretty straightforward, but I had a few questions:

  1. I don't think we can reasonably do full STAC validation using a library like this. Something like validating datetime, start_datetime, and end_datetime are together valid (i.e. start_datetime and end_datetime are provided if datetime is None), sounds hard and slow. We should I think view validation as a nice benefit, rather than the motivation for using one of these libraries.
  2. It wasn't clear to me what the relationship would be between pystac objects and (say) msgspec's objects. Would a pystac.Item be a msgspec.Struct subclass? Or would it internally use a struct (which would I think remove much of the performance benefit)?

We probably don't want to use pydantic unless it's gotten a lot faster since last checked.

In theory pydanic v2 is much faster (though still slower than msgspec last I saw).

gadomski commented 1 year ago

I did a dead-simple msgspec implementation myself just now using class Item(Struct), and ran into issues around flattening dictionaries: https://github.com/jcrist/msgspec/issues/315. It's not uncommon to have extra fields at the top level of STAC objects, and those need to be captured by a deserialization library. Maybe there's a good way to do it, but it wasn't immediately obvious to me.

For more context, this is how it's done in Rust: https://github.com/gadomski/stac-rs/blob/dfaaabc00f581af3d6b948ee3de24f4b68e5acdd/stac/src/item.rs#L66-L68

huard commented 1 year ago

There's a pydantic-stac implementation here https://github.com/stac-utils/stac-pydantic which doesn't seem really active, but I still made a PR to migrate it to pydantic v2 in the hope it'd be useful at some point.

gadomski commented 1 year ago

@thomas-maschler did some ad-hoc benchmarking in https://github.com/radiantearth/stac-spec/discussions/1252#discussioncomment-7124517 and found pydantic to be slower than than pystac in the deserialization case.

eseglem commented 10 months ago

I would be very curious about where the time difference are coming from. All things being the same, I would definitely expect Pydantic should be faster since most of the work is happening in Rust and not Python. If I had to take a guess, it may be related to pystac validation being optional vs being built in with pydantic.

There may be some ways around that as well as ways to improve performance vs those benchmarks. It could be very beneficial to go that route, if the performance is acceptable, as it would supersede stac-pydantic and help consolidate the ecosystem. I guess it really depends on how important that performance is vs maintenance effort and such.

rbavery commented 6 months ago

bumping this! the code complexity for pystac is pretty high and when implementing the MLModel extension, pydantic v2 has felt more approachable. is pydantic 2 or another serialization library an option for pystac v2?

gadomski commented 6 months ago

is pydantic 2 or another serialization library an option for pystac v2?

Could be, however no one (that I know of) is currently working on a pystac v2 at the moment.