Open drwelby opened 1 year ago
😍
The format isn't set in stone so if needs adjusting, lmk
@drwelby I've started to work with the Geoparquet file but it hasn't been updated since April. Do you know if it will updated at one point? I would love to ingest the latest data for the Morocco EarthQuake
I'll regenerate it today, I don't have a great workflow yet to keep it updated automatically.
Updated just now
@drwelby what do you think about creating per-collection
GeoParquet files and then add the link to the GeoParquet file in the collection's assets?
I don't want to give you more work (well I do) but I think it could be really handy (and maybe easier to manage)
Certainly possible. I had some other related questions that I sent directly to your Mastodon account.
If the smaller parquet files can be referenced by the top level one that would be an easier update workflow but I'm not sure if that's functional in the geoparquet region. It's all experimental, so if you want me to experiment with any format that improves interop just point me to the right docs!
We could also generate STAC Search Collections instead of STAC Static Collections which would be much faster to ingest and readable by GDAL.
👍 sorry I do not login often on Mastodon.
As for the GeoParquet format, I'm not an expert but from what I know there is no multiple level GeoParquet, so if you want to keep the GDAL compatibility for your whole catalog, you'll have to continue what you are doing.
The issue with this approach is that there is no collection
metadata, and also means that all the items have to respect the same model
(or you end up having a lot of none
)
We're talking internally about stac + GeoParquet, trying to define what should the combination look like and it really depends on the usage of the GeoParquet file. There is not one good schema
to store both collection / items
The issue with this approach is that there is no
collection
metadata, and also means that all the items have to respect the samemodel
(or you end up having a lot ofnone
)
Yeah this is the case with your current maxar-opendata.parquet
file. There are a lot of null values internally because it looks like the schema of included stac data changes a lot. Here's an example. For each row in the assets
column, we check how many of the included keys are null or not null. In Arrow/Parquet representation, the assets
column includes all keys for every row, whether or not that row actually includes a value with that key. Below we compute the number of null assets per row
import pyarrow.parquet as pq
table = pq.read_table('maxar-opendata.parquet')
null_counts = []
for i in range(len(table)):
null_count = sum([1 for _asset_name, asset_value in table['assets'][i].items() if not asset_value.is_valid])
null_counts.append(null_count)
Plotting this, we can see that most rows have only a couple asset names and most are null.
If we look at a single row, you can inspect which keys are None
table['assets'][0]
<pyarrow.StructScalar: [('building-centroids', None), ('building-footprints', None), ('cloud-mask', None), ('cloud-mask-raster', None), ('cloud-shadow-mask', None), ('data-mask', {'href': 'https://maxar-opendata.s3.amazonaws.com/events/Gambia-flooding-8-11-2022/ard/28/033133031212/2022-03-15/1040010073D77D00-data-mask.gpkg', 'roles': ['data-mask'], 'title': 'Data Mask', 'type': 'application/geopackage+sqlite3'}), ('healthy-vegetation-mask', None), ('healthy-vegetation-mask-raster', None), ('ms-saturation-mask', None), ('ms-saturation-mask-raster', None), ('ms_analytic', {'eo:bands': [{'common_name': 'coastal', 'description': 'Coastal Blue', 'name': 'BAND_C'}, {'common_name': 'blue', 'description': 'Blue', 'name': 'BAND_B'}, {'common_name': 'green', 'description': 'Green', 'name': 'BAND_G'}, {'common_name': 'yellow', 'description': 'Yellow', 'name': 'BAND_Y'}, {'common_name': 'red', 'description': 'Red', 'name': 'BAND_R'}, {'common_name': 'rededge', 'description': 'Red Edge 1', 'name': 'BAND_RE'}, {'common_name': 'nir08', 'description': 'Near Infrared 1', 'name': 'BAND_N'}, {'common_name': 'nir09', 'description': 'Near Infrared 2', 'name': 'BAND_N2'}], 'href': 'https://maxar-opendata.s3.amazonaws.com/events/Gambia-flooding-8-11-2022/ard/28/033133031212/2022-03-15/1040010073D77D00-ms.tif', 'proj:bbox': [309843.75, 1489843.75, 315156.25, 1495156.25], 'proj:shape': [3583, 3583], 'proj:transform': [1.4826960647502094, 0.0, 309843.75, 0.0, -1.4826960647502094, 1495156.25, 0.0, 0.0, 1.0], 'roles': ['data'], 'title': 'Multispectral Image', 'type': 'image/tiff; application=geotiff; profile=cloud-optimized'}), ('pan-flare-mask', None), ('pan-flare-mask-raster', None), ('pan_analytic', {'eo:bands': [{'description': 'Pan', 'name': 'BAND_P'}], 'href': 'https://maxar-opendata.s3.amazonaws.com/events/Gambia-flooding-8-11-2022/ard/28/033133031212/2022-03-15/1040010073D77D00-pan.tif', 'proj:bbox': [309843.75, 1489843.75, 315156.25, 1495156.25], 'proj:shape': [14332, 14332], 'proj:transform': [0.37067401618755236, 0.0, 309843.75, 0.0, -0.37067401618755236, 1495156.25, 0.0, 0.0, 1.0], 'roles': ['data'], 'title': 'Panchromatic Image', 'type': 'image/tiff; application=geotiff; profile=cloud-optimized'}), ('terrain-shadow-mask', None), ('terrain-shadow-mask-raster', None), ('visual', {'eo:bands': [{'common_name': 'red', 'description': 'Red', 'name': 'BAND_R'}, {'common_name': 'green', 'description': 'Green', 'name': 'BAND_G'}, {'common_name': 'blue', 'description': 'Blue', 'name': 'BAND_B'}], 'href': 'https://maxar-opendata.s3.amazonaws.com/events/Gambia-flooding-8-11-2022/ard/28/033133031212/2022-03-15/1040010073D77D00-visual.tif', 'proj:bbox': [309843.75, 1489843.75, 315156.25, 1495156.25], 'proj:shape': [17408, 17408], 'proj:transform': [0.30517578125, 0.0, 309843.75, 0.0, -0.30517578125, 1495156.25, 0.0, 0.0, 1.0], 'roles': ['visual'], 'title': 'Visual Image', 'type': 'image/tiff; application=geotiff; profile=cloud-optimized'}), ('water-mask', None), ('water-mask-raster', None)]>
Now in terms of the overhead of actually storing this, it isn't particularly inefficient. Parquet on disk and Arrow in memory can compress the null values pretty efficiently. But the usability is hard because every row might have a different data representation, so you never know in advance whether a given asset will exist or not in a row.
We'll probably pare out some of the assets that were briefly delivered and only include the ones we always deliver: visual, pan, and multispectral
If the end user of this data is a web browser/app, device memory/cpu/resources will be the limiting factor, not data transfer speed. I don't see a benefit using parquet in that case.
Developers often make claims like, "1 million points displayed in the web browser", which is never the case. It's a subset based on zoom level. Again, client resources are the limiting factor.
If you're one of the elite few that your Chrome browser can actually process a million vector geometries in a timely manner, then 'faster' data transfer will probably be a concern.
This has that "everybody needs big data" feel , when that is actually not the case. Reality seems to be closer to about 5% need those features.
Maxar seems to catering to their big data customer's and I get that. For the rest of us, I don't see a clear benefit.
Just chiming in here to register some interest in a convention / standard for how to represent STAC items in geoparquet.
In https://github.com/stac-utils/stac-geoparquet I made some arbitrary decisions around things like lifting properties to the top level of the parquet file.
This is not Maxar "catering to customers", this is one Maxar developer (me) experimenting with a new cloud-native vector format. My interest in this format is that can be read by DuckDB to provide queryability since the ARD pipeline only delivers a system of static STAC catalogs.
@TomAugspurger I used stac-geoparquet
to generate the file (thanks!) and and happy to assist with changes/ideas/experiments using the ODP data.
a new cloud-native vector format
FlatGeobuf (FGB) uses a spatial index and http 206 requests. Build on that.
A queyable format might consist of multiple indexes, ie, (geom, col1, col2), etc, with the data. Your query conditional, such as 'where', could use only a column in the index. Like FGB, multiple requests would be made.
I still think you're putting the cart before the horse. Even with a queryable format, huge amounts of data can be returned, rendering your browser a boat anchor. Some of the examples I've seen with parquet and duckdb, wasm, etc are pretty awesome, but functionally useless in terms of the web client, because of the amount of data returned.
Parquet is being touted as the miracle format that can transfer huge amounts of spatial data quickly. Great. But in terms of usefulness in the browser, it doesn't matter. Any cloud native vector format will do just fine.
Thanks everyone for jumping into this issue. Just to be clear in this issue, we envision the usage of the GeoParquet file format for an easy way to share a Static catalog. And as the title of the repo suggest, we want to ingest the static catalog to any PgSTAC instance.
Maxar seems to catering to their big data customer's and I get that. For the rest of us, I don't see a clear benefit.
@PostholerCom, I don't think this is the kinda of comment we are looking for in this issue. Thank you
This experiment is not aimed at browser use. I can make FGB too if you like, but this is volunteer effort that came out of a hackathon and so far I only have the bandwidth to investigate one cloud-optimized vector format.
As we like to say, "PRs welcome and appreciated" and jumping in, crawling the Open Data bucket and making some FGBs for the project would go a much longer way than unwarranted ranting.
Whoops, I forgot that for browser use, I did generate a .pmtiles
because there was some discussion of having source.coop automatically preview this format in their UI.
For a faster import Maxar now has all the STAC Items as geoparquet: https://maxar-opendata.s3.amazonaws.com/events/maxar-opendata.parquet