stac-utils / stac-geoparquet

Convert STAC items between JSON, GeoParquet, pgstac, and Delta Lake.
https://stac-utils.github.io/stac-geoparquet/
MIT License
78 stars 10 forks source link

ValueError: Invalid character while parsing year ('N', Index: 0) #79

Open scottyhq opened 3 hours ago

scottyhq commented 3 hours ago

For a heterogenous collection of STAC Items with some containing a timestamp property like updated and others that do not, coercing to timestamps fails because the code seems to be trying to convert a pyarrow 'None' string to timestamp: https://github.com/stac-utils/stac-geoparquet/blob/4b00f5be649609a896242f391d6e9c56377c7f25/stac_geoparquet/arrow/_to_arrow.py#L82

I think this scenario might be common for APIs that are returning metadata that changes over time. I came across this using this public endpoint https://docs.canopy.umbra.space/docs/archive-catalog-searching-via-stac-api

I tried a quick fix which seems to work, but not sure it's the best approach... I just removed ciso8601 and let Arrow handle the casting 😅.

Alternatively, using pandas to coerce timestamps is also mentioned here https://github.com/stac-utils/stac-geoparquet/pull/31#discussion_r1544730642

kylebarron commented 3 hours ago

I think you can fix it either way, such as by avoiding casting None to string here. But I also didn't know that pyarrow was able to cast the strings to dates, and so that's more appealing to me.

We shouldn't use pandas because this arrow module is intended to not have a dependency on pandas.

scottyhq commented 3 hours ago

But I also didn't know that pyarrow was able to cast the strings to dates

I'm new to arrow, so I definitely fumbled around a bit!

I thought this would work: pa.scalar('2024-08-24T17:52:27.135933+00:00', type=pa.timestamp('us', tz='UTC')) but raises ArrowTypeError: object of type <class 'str'> cannot be converted to int

But it works if you first go to a pyarrow string and then cast: pa.scalar(timestamp_str, type=pa.string()).cast(pa.timestamp('us', tz='UTC'))