stac-utils / pgstac

Schema, functions and a python library for storing and accessing STAC collections and items in PostgreSQL
MIT License
153 stars 39 forks source link

Validation Error and Content Mismatch in Sentinel-2-L1C Item Definition JSON #238

Open MathewNWSH opened 10 months ago

MathewNWSH commented 10 months ago

Hello, I managed to create a collection definition json and a sample item json. The sample item was created using a script inspired by https://github.com/dlr-eoc/EOmetadataTool. Then I found the Sentinel-2-L2A collection definition JSON on the Microsoft Planetary Computer. Based on this, and the sample item, I created a version (in my opinion) compatible with the L1C level of Sentinel-2. Here they are:

https://s3.waw3-2.cloudferro.com/swift/v1/stac/collection.json https://s3.waw3-2.cloudferro.com/swift/v1/stac/s2l1c_item.json

I managed to upload it to pgSTAC using:

pypgstac load collections /home/eouser/Downloads/collection.json
pypgstac load items /home/eouser/Desktop/s2_item.json

Then I checked the content of collections table: https://s3.waw3-2.cloudferro.com/swift/v1/stac/collection_base_item.json https://s3.waw3-2.cloudferro.com/swift/v1/stac/collection_content.json

and items table: https://s3.waw3-2.cloudferro.com/swift/v1/stac/item_content.json

As you can see, in the items table, the column labeled "content" contains the "π’Ÿβ€»" sign:

    "safe_manifest": {
      "href": "/eodata/Sentinel-2/MSI/L1C/2024/01/16/S2A_MSIL1C_20240116T000741_N0510_R130_T51CVP_20240116T010505.SAFE/manifest.safe",
      "title": "π’Ÿβ€»"
    },
    "granule_metadata": {
      "href": "/eodata/Sentinel-2/MSI/L1C/2024/01/16/S2A_MSIL1C_20240116T000741_N0510_R130_T51CVP_20240116T010505.SAFE/GRANULE/L1C_T51CVP_A044743_20240116T000743/MTD_TL.xml",
      "title": "π’Ÿβ€»"
    },
    "inspire_metadata": {
      "href": "/eodata/Sentinel-2/MSI/L1C/2024/01/16/S2A_MSIL1C_20240116T000741_N0510_R130_T51CVP_20240116T010505.SAFE/INSPIRE.xml",
      "title": "π’Ÿβ€»"
    },
    "product_metadata": {
      "href": "/eodata/Sentinel-2/MSI/L1C/2024/01/16/S2A_MSIL1C_20240116T000741_N0510_R130_T51CVP_20240116T010505.SAFE/MTD_MSIL1C.xml",
      "title": "π’Ÿβ€»"
    },
    "datastrip_metadata": {
      "href": "/eodata/Sentinel-2/MSI/L1C/2024/01/16/S2A_MSIL1C_20240116T000741_N0510_R130_T51CVP_20240116T010505.SAFE/DATASTRIP/DS_2APS_20240116T010505_S20240116T000743/MTD_DS.xml",
      "title": "π’Ÿβ€»"

even though the content of the title was defined within the base item in collections table.

Then I tried to validate my initial JSONs using pystac validate:

(python3.11) eouser@ubuntu-wms:~$ stac validate /home/eouser/Downloads/collection.json
[
    {
        "version": "1.0.0",
        "path": "/home/eouser/Downloads/collection.json",
        "schema": [
            "https://schemas.stacspec.org/v1.0.0/collection-spec/json-schema/collection.json"
        ],
        "valid_stac": true,
        "asset_type": "COLLECTION",
        "validation_method": "recursive"
    }
]
(python3.11) eouser@ubuntu-wms:~$ stac validate /home/eouser/Desktop/s2l1c_item.json
[
    {
        "version": "1.0.0",
        "path": "/home/eouser/Desktop/s2_item.json",
        "schema": [
            "https://schemas.stacspec.org/v1.0.0/item-spec/json-schema/item.json"
        ],
        "valid_stac": false,
        "asset_type": "ITEM",
        "validation_method": "recursive",
        "error_type": "JSONSchemaValidationError",
        "error_message": "'sentinel-2-l1c' should not be valid under {}. Error is in collection"
    }
]

but i can't quite get the meaning of the error message.

Could you please guide me on what is wrong with the item definition JSON and how to correctly create a corresponding collection with items?

m-mohr commented 10 months ago

'sentinel-2-l1c' should not be valid under {}. Error is in collection"

This message pretty much says in a very obscure way that the Item is missing a link with the rel type collection to the collection that is referenced in the Item in the collection property.

You need to add a link such as the following to the Item to pass validation:

    {
      "rel": "collection",
      "href": "./collection.json",
      "type": "application/json"
    }

As you can see, in the items table, the column labeled "content" contains the "π’Ÿβ€»" sign: [...] even though the content of the title was defined within the base item in collections table.

I think the validation error is not what leads to the the π’Ÿβ€». These characters form a "magic marker" that indicates that a key should not be rehydrated: https://github.com/search?q=repo%3Astac-utils%2Fpgstac%20%F0%92%8D%9F%E2%80%BB&type=code

As far as I understand it, it's an internal marker that is not exposed to the public via the API and is intentially added to the database for deduplication purposes. I think it will be replaced with the value from the Item Assets Defintion that is defined in the corresponding Collection (internally: a base item). Did you check how stac-fastapi makes the items available through the API? I think it should output the item as expected.

Disclaimer: That's how I read the code, I've seen this marker today for the first time ;-)

bitner commented 7 months ago

Yes, we use the item_assets property of a collection to be able to be able to reduce the size of the item json that is stored per item. This can make a huge difference in size on items that have large amounts of assets where every item has some number of the same properties. In instances where there is an asset type in the collections item_assets that is not present in an items assets, we use the "π’Ÿβ€»" marker to indicate that we should not pull that property in from the item_assets on the collections (we use the base_item "view" of the collection to coerce the properties from the collection to look like an items json).

Additionally, we strip the geometry, id, collection, and type from the item json to reduce disk space as those fields are promoted to actual columns in Postgres.

In the code, you can see this process referred to as hydration/dehydration going between the external json representation of a STAC Item and the pgstac internal storage.

The only validation that happens in pgstac is:

  1. The item must be valid json
  2. The item must have an id
  3. The item must have a collection that is already present in the collections table
  4. The item must have either a properties.datetime OR both a properties.start_datetime AND properties.end_datetime
  5. The item must have a valid geojson geometry that can be converted by st_fromgeojson into a postgis geometry
zacdezgeo commented 7 months ago

@MathewNWSH, have you been able to resolve your issue? Is there anything to follow-up on?