vincentsarago / MAXAR_opendata_to_pgstac

Create STAC Collections/Items for some AWS OpenData
MIT License
10 stars 1 forks source link

Geoparquet source #3

Open drwelby opened 10 months ago

drwelby commented 10 months ago

For a faster import Maxar now has all the STAC Items as geoparquet: https://maxar-opendata.s3.amazonaws.com/events/maxar-opendata.parquet

vincentsarago commented 10 months ago

😍

drwelby commented 10 months ago

The format isn't set in stone so if needs adjusting, lmk

vincentsarago commented 10 months ago

@drwelby I've started to work with the Geoparquet file but it hasn't been updated since April. Do you know if it will updated at one point? I would love to ingest the latest data for the Morocco EarthQuake

drwelby commented 10 months ago

I'll regenerate it today, I don't have a great workflow yet to keep it updated automatically.

drwelby commented 10 months ago

Updated just now

vincentsarago commented 10 months ago

@drwelby what do you think about creating per-collection GeoParquet files and then add the link to the GeoParquet file in the collection's assets?

I don't want to give you more work (well I do) but I think it could be really handy (and maybe easier to manage)

drwelby commented 10 months ago

Certainly possible. I had some other related questions that I sent directly to your Mastodon account.

If the smaller parquet files can be referenced by the top level one that would be an easier update workflow but I'm not sure if that's functional in the geoparquet region. It's all experimental, so if you want me to experiment with any format that improves interop just point me to the right docs!

We could also generate STAC Search Collections instead of STAC Static Collections which would be much faster to ingest and readable by GDAL.

vincentsarago commented 10 months ago

👍 sorry I do not login often on Mastodon.

As for the GeoParquet format, I'm not an expert but from what I know there is no multiple level GeoParquet, so if you want to keep the GDAL compatibility for your whole catalog, you'll have to continue what you are doing.

The issue with this approach is that there is no collection metadata, and also means that all the items have to respect the same model (or you end up having a lot of none)

We're talking internally about stac + GeoParquet, trying to define what should the combination look like and it really depends on the usage of the GeoParquet file. There is not one good schema to store both collection / items

kylebarron commented 10 months ago

The issue with this approach is that there is no collection metadata, and also means that all the items have to respect the same model (or you end up having a lot of none)

Yeah this is the case with your current maxar-opendata.parquet file. There are a lot of null values internally because it looks like the schema of included stac data changes a lot. Here's an example. For each row in the assets column, we check how many of the included keys are null or not null. In Arrow/Parquet representation, the assets column includes all keys for every row, whether or not that row actually includes a value with that key. Below we compute the number of null assets per row

import pyarrow.parquet as pq
table = pq.read_table('maxar-opendata.parquet')

null_counts = []
for i in range(len(table)):
    null_count = sum([1 for _asset_name, asset_value in table['assets'][i].items() if not asset_value.is_valid])
    null_counts.append(null_count)

Plotting this, we can see that most rows have only a couple asset names and most are null.

image

If we look at a single row, you can inspect which keys are None

table['assets'][0]
<pyarrow.StructScalar: [('building-centroids', None), ('building-footprints', None), ('cloud-mask', None), ('cloud-mask-raster', None), ('cloud-shadow-mask', None), ('data-mask', {'href': 'https://maxar-opendata.s3.amazonaws.com/events/Gambia-flooding-8-11-2022/ard/28/033133031212/2022-03-15/1040010073D77D00-data-mask.gpkg', 'roles': ['data-mask'], 'title': 'Data Mask', 'type': 'application/geopackage+sqlite3'}), ('healthy-vegetation-mask', None), ('healthy-vegetation-mask-raster', None), ('ms-saturation-mask', None), ('ms-saturation-mask-raster', None), ('ms_analytic', {'eo:bands': [{'common_name': 'coastal', 'description': 'Coastal Blue', 'name': 'BAND_C'}, {'common_name': 'blue', 'description': 'Blue', 'name': 'BAND_B'}, {'common_name': 'green', 'description': 'Green', 'name': 'BAND_G'}, {'common_name': 'yellow', 'description': 'Yellow', 'name': 'BAND_Y'}, {'common_name': 'red', 'description': 'Red', 'name': 'BAND_R'}, {'common_name': 'rededge', 'description': 'Red Edge 1', 'name': 'BAND_RE'}, {'common_name': 'nir08', 'description': 'Near Infrared 1', 'name': 'BAND_N'}, {'common_name': 'nir09', 'description': 'Near Infrared 2', 'name': 'BAND_N2'}], 'href': 'https://maxar-opendata.s3.amazonaws.com/events/Gambia-flooding-8-11-2022/ard/28/033133031212/2022-03-15/1040010073D77D00-ms.tif', 'proj:bbox': [309843.75, 1489843.75, 315156.25, 1495156.25], 'proj:shape': [3583, 3583], 'proj:transform': [1.4826960647502094, 0.0, 309843.75, 0.0, -1.4826960647502094, 1495156.25, 0.0, 0.0, 1.0], 'roles': ['data'], 'title': 'Multispectral Image', 'type': 'image/tiff; application=geotiff; profile=cloud-optimized'}), ('pan-flare-mask', None), ('pan-flare-mask-raster', None), ('pan_analytic', {'eo:bands': [{'description': 'Pan', 'name': 'BAND_P'}], 'href': 'https://maxar-opendata.s3.amazonaws.com/events/Gambia-flooding-8-11-2022/ard/28/033133031212/2022-03-15/1040010073D77D00-pan.tif', 'proj:bbox': [309843.75, 1489843.75, 315156.25, 1495156.25], 'proj:shape': [14332, 14332], 'proj:transform': [0.37067401618755236, 0.0, 309843.75, 0.0, -0.37067401618755236, 1495156.25, 0.0, 0.0, 1.0], 'roles': ['data'], 'title': 'Panchromatic Image', 'type': 'image/tiff; application=geotiff; profile=cloud-optimized'}), ('terrain-shadow-mask', None), ('terrain-shadow-mask-raster', None), ('visual', {'eo:bands': [{'common_name': 'red', 'description': 'Red', 'name': 'BAND_R'}, {'common_name': 'green', 'description': 'Green', 'name': 'BAND_G'}, {'common_name': 'blue', 'description': 'Blue', 'name': 'BAND_B'}], 'href': 'https://maxar-opendata.s3.amazonaws.com/events/Gambia-flooding-8-11-2022/ard/28/033133031212/2022-03-15/1040010073D77D00-visual.tif', 'proj:bbox': [309843.75, 1489843.75, 315156.25, 1495156.25], 'proj:shape': [17408, 17408], 'proj:transform': [0.30517578125, 0.0, 309843.75, 0.0, -0.30517578125, 1495156.25, 0.0, 0.0, 1.0], 'roles': ['visual'], 'title': 'Visual Image', 'type': 'image/tiff; application=geotiff; profile=cloud-optimized'}), ('water-mask', None), ('water-mask-raster', None)]>

Now in terms of the overhead of actually storing this, it isn't particularly inefficient. Parquet on disk and Arrow in memory can compress the null values pretty efficiently. But the usability is hard because every row might have a different data representation, so you never know in advance whether a given asset will exist or not in a row.

drwelby commented 10 months ago

We'll probably pare out some of the assets that were briefly delivered and only include the ones we always deliver: visual, pan, and multispectral

PostholerCom commented 9 months ago

If the end user of this data is a web browser/app, device memory/cpu/resources will be the limiting factor, not data transfer speed. I don't see a benefit using parquet in that case.

Developers often make claims like, "1 million points displayed in the web browser", which is never the case. It's a subset based on zoom level. Again, client resources are the limiting factor.

If you're one of the elite few that your Chrome browser can actually process a million vector geometries in a timely manner, then 'faster' data transfer will probably be a concern.

This has that "everybody needs big data" feel , when that is actually not the case. Reality seems to be closer to about 5% need those features.

Maxar seems to catering to their big data customer's and I get that. For the rest of us, I don't see a clear benefit.

TomAugspurger commented 9 months ago

Just chiming in here to register some interest in a convention / standard for how to represent STAC items in geoparquet.

In https://github.com/stac-utils/stac-geoparquet I made some arbitrary decisions around things like lifting properties to the top level of the parquet file.

drwelby commented 9 months ago

This is not Maxar "catering to customers", this is one Maxar developer (me) experimenting with a new cloud-native vector format. My interest in this format is that can be read by DuckDB to provide queryability since the ARD pipeline only delivers a system of static STAC catalogs.

drwelby commented 9 months ago

@TomAugspurger I used stac-geoparquet to generate the file (thanks!) and and happy to assist with changes/ideas/experiments using the ODP data.

PostholerCom commented 9 months ago

a new cloud-native vector format

FlatGeobuf (FGB) uses a spatial index and http 206 requests. Build on that.

A queyable format might consist of multiple indexes, ie, (geom, col1, col2), etc, with the data. Your query conditional, such as 'where', could use only a column in the index. Like FGB, multiple requests would be made.

I still think you're putting the cart before the horse. Even with a queryable format, huge amounts of data can be returned, rendering your browser a boat anchor. Some of the examples I've seen with parquet and duckdb, wasm, etc are pretty awesome, but functionally useless in terms of the web client, because of the amount of data returned.

Parquet is being touted as the miracle format that can transfer huge amounts of spatial data quickly. Great. But in terms of usefulness in the browser, it doesn't matter. Any cloud native vector format will do just fine.

vincentsarago commented 9 months ago

Thanks everyone for jumping into this issue. Just to be clear in this issue, we envision the usage of the GeoParquet file format for an easy way to share a Static catalog. And as the title of the repo suggest, we want to ingest the static catalog to any PgSTAC instance.

Maxar seems to catering to their big data customer's and I get that. For the rest of us, I don't see a clear benefit.

@PostholerCom, I don't think this is the kinda of comment we are looking for in this issue. Thank you

drwelby commented 9 months ago

This experiment is not aimed at browser use. I can make FGB too if you like, but this is volunteer effort that came out of a hackathon and so far I only have the bandwidth to investigate one cloud-optimized vector format.

As we like to say, "PRs welcome and appreciated" and jumping in, crawling the Open Data bucket and making some FGBs for the project would go a much longer way than unwarranted ranting.

drwelby commented 9 months ago

Whoops, I forgot that for browser use, I did generate a .pmtiles because there was some discussion of having source.coop automatically preview this format in their UI.