microsoft / GlobalMLBuildingFootprints

Worldwide building footprints derived from satellite imagery
Other
1.33k stars 198 forks source link

How to access latest data and understand versions #88

Open tomalrussell opened 6 months ago

tomalrussell commented 6 months ago

First, thanks and much appreciation for making this both open and accessible, both via the dataset-links.csv linked from this repository and on the planetary computer data catalog!

I can happily access either source, but I'm not clear which data releases are available in each location, or how best to pull updates.

Should I be aware of tools to help with bulk access or reading metadata for either location?

edkry commented 4 months ago

I am experiencing the same issue

andwoi commented 3 months ago

The latest version of the data is always released here through an updated dataset-links.csv. There could be delays in other sources like planetary computer where there is a hand off. We use the dataset-links.csv and json format to try and maintain backward compatibility -- older releases were smaller and could be shared via country name or state name in the case of the US release. There is certainly a case to be made to 1) open the storage account for bulk transfer and 2) use a more compressed format like geoparquet (which could cause other user headaches).

The readme updates and the dataset-links.csv dates do not match. Data deliveries are automated, but the readme is manual. We can update the documentation process to ensure dates align.

johnphilippowell commented 3 months ago

+1 for geoparquet, as especially with tools like duckdb it is super efficient. Good to know that dataset-links.csv is always up to date, as that process is very easy to use, even if slower than parquet.

tomalrussell commented 3 months ago

Thanks @andwoi, that helps - I think having the dates in the folder structure align with the README note would have cleared up at least part of my confusion.

Am I right to understand that the whole dataset is updated or re-provided through your delivery pipeline each release? Would it be possible/interesting to include a "last-updated" column and or an "imagery dates" column as additional metadata in dataset-links.csv?