microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
185 stars 9 forks source link

Significant numbers of missing S1 RTC scenes #372

Open JackDunnNZ opened 3 months ago

JackDunnNZ commented 3 months ago

We are observing that a large number of S1 RTC scenes are not present in the catalog, but are present in the raw data.

The following code searches over an arbitrary AOI (in this case the African continent) and date range (Jan-July 2024) and compares the scenes in the RTC catalog to the scenes in the earthsearch catalog (not the true raw data, but for simplicity):

import planetary_computer
import pystac_client

bbox = [-17.578125000000004, -36.3151251474805, 54.84375000000001, 37.43997405227057]
datetime_end = "2024-08-01"
datetime_start = "2024-01-01"

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)
search = catalog.search(
    collections=["sentinel-1-rtc"],
    datetime=f"{datetime_start}/{datetime_end}",
    bbox=bbox,
    fields={"include": ["id"], "exclude": ["links", "assets", "properties", "bbox", "geometry"]},
    limit=1000,
)
items1 = list(search.items_as_dicts())
print(f"Found {len(items1)} items")
ids1 = set(item["id"] for item in items1)

# Try the earth-search
client = pystac_client.Client.open("https://earth-search.aws.element84.com/v1")
search = client.search(
    collections=["sentinel-1-grd"],
    datetime=f"{datetime_start}/{datetime_end}",
    bbox=bbox,
    fields={"include": ["id"], "exclude": ["links", "assets", "bbox", "geometry"]},
    limit=1000,
)
items2 = list(search.items_as_dicts())
print(f"Found {len(items2)} items")
ids2 = set(item["id"] + "_rtc" for item in items2)

n_missing = len(ids2 - ids1)
print(f"{n_missing} items ({n_missing * 100 / len(items2)}%) missing from RTC catalog")

For this particular query, the RTC catalog is missing 3116 out of 21290 scenes (~15%). At risk of sounding ungrateful for such an excellent resource, this is making it hard for us to use the RTC as a data source, but ideally we would like to avoid the effort of managing it ourselves.

I see from older issues there was a plan for a validation process that would help prevent such gaps in the catalog. Has there been progress on that front, and is there any way we could help at all?