Open tomalrussell opened 6 months ago
I am experiencing the same issue
The latest version of the data is always released here through an updated dataset-links.csv. There could be delays in other sources like planetary computer where there is a hand off. We use the dataset-links.csv and json format to try and maintain backward compatibility -- older releases were smaller and could be shared via country name or state name in the case of the US release. There is certainly a case to be made to 1) open the storage account for bulk transfer and 2) use a more compressed format like geoparquet (which could cause other user headaches).
The readme updates and the dataset-links.csv dates do not match. Data deliveries are automated, but the readme is manual. We can update the documentation process to ensure dates align.
+1 for geoparquet, as especially with tools like duckdb it is super efficient. Good to know that dataset-links.csv is always up to date, as that process is very easy to use, even if slower than parquet.
Thanks @andwoi, that helps - I think having the dates in the folder structure align with the README note would have cleared up at least part of my confusion.
Am I right to understand that the whole dataset is updated or re-provided through your delivery pipeline each release? Would it be possible/interesting to include a "last-updated" column and or an "imagery dates" column as additional metadata in dataset-links.csv
?
First, thanks and much appreciation for making this both open and accessible, both via the dataset-links.csv linked from this repository and on the planetary computer data catalog!
I can happily access either source, but I'm not clear which data releases are available in each location, or how best to pull updates.
https://minedbuildings.blob.core.windows.net/global-buildings/dataset-links.csv lists all URLs with
2023-12-26
in the path. Is2023-12-26
the release date for all these files?the README.md here lists an update for
2024-01-03
, particularly for buildings in Brazil and Italy. Is this update included in files linked fromdataset-links.csv
?the planetary computer example notebook shows how to access the data as a Delta Table, which lists URIs under
2023-04-25/ml-buildings.parquet
, andtable.history()
gives a single WRITE operation at timestamp 1682774982678, around2023-04-29
. Are any of the more recent updates listed in this repository present in that parquet dataset, or are there plans to push updates there?Should I be aware of tools to help with bulk access or reading metadata for either location?
I can request a signed URL with an SAS token for the delta table blob storage container from https://planetarycomputer.microsoft.com/api/sas/v1/sign?href=https://bingmlbuildings.blob.core.windows.net/footprints/delta and give that to
azcopy list
, though I don't discover any other versions or updates there.I'm not sure how to directly access or list all files under
https://minedbuildings.blob.core.windows.net/global-buildings
, only directly accessing those listed in the CSV, or linked from the README in history here (e.g. Abyei. Are all versions (or all latest versions, with release/update metadata) intended to be accessible?