Open m-mohr opened 1 year ago
Thanks @m-mohr !
I'd be keen to see an alignment of Science on Schema guidelines from the Earth Science Information Partners Federation with the schema.org being generated from STAC JSON. (because it would be so nice if these two JSON metadata formats from two widely adopted earth science communities were more interoperable :blush: !)
@cboettig ~Is there an equivalent for STAC Collections in SOSO? I'm using a DataCatalog right now, but it seems not to be part of SOSO.~ Edit: Just found https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#collections-of-datasets-using-schemaorg-datacatalog - I'm currently struggling to get the DataCatalog be included in Google Dataset Search though...
I'll have a look at the Dataset guideline on what we can improve in the Browser. I don't think there's a good equivalent for Data Repository? Maybe the root catalog? What do you think is the best way to align SOSO and STAC?
One issue with Dataset is that it is likely what STAC Items are, but STAC Items are often not very descriptive (just have an id, but no title or description). Or should STAC Collections be Datasets?
Thanks @m-mohr -- good questions! I should follow up with the ESIP devs, I'm mostly a data consumer working across products in both stac and ESIP and hoping to connect the dots!
I think there are analogous concepts for the collection / catalog / item levels of STAC but am not sure the best choices. My understanding is that schema.org/Dataset was based on the original W3C DCAT (Data Catalog) standard, now in it's 3rd version, which I think has all these notions. I know the ESIP folks know the W3C standards well and I think their style of schema.org roughly parallels that, but I'm not expert here.
@mbjones or others probably have good advice here.
Thanks. So right now I'm mapping: STAC Collection (or Catalog) -> DataCatalog STAC Item -> Dataset STAC Asset -> DataDownload
I'm not sure whether that's ideal though due to the limited information in a STAC Item. What we can find now in GDS is just Datasets with limited information, but no DataCatalogs, which have much more information.
Any insights would be appreciated.
This seems reasonable to me at least. I'm also interested in the stac extensions or at least those extensions that have good parallels to science-on-schema (e.g. scientific citation, file info, table).
In some ways mapping such extensions to schema.org is particularly compelling where there are schema.org based dataset browsing tools that can already take advantage of indexing on such fields as "author" or "column name" that are not as first-class in stac search....
@m-mohr regarding your comment:
I'm currently struggling to get the DataCatalog be included in Google Dataset Search though...
yeah, I noticed that too. I got some great advice from @mbjones on possible culprits for this:
Google tools do support SO markup in pages loaded with javascript generation, but there are timeouts and other issues to pay attention to. Records will fail the google ingest if some key metadata are outside google’s parameters — for example, schema:description must be > 50 and < 5000 characters or google will reject the record.
Do you have a sitemap.xml
that directs google to these landing pages for crawling? Sometimes google doesn’t find stuff to crawl without a sitemap. See: https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/GETTING-STARTED.md#sitemaps
Also, while I think the mapping you have
STAC Collection (or Catalog) -> DataCatalog STAC Item -> Dataset STAC Asset -> DataDownload
Makes sense from a literal/technical standpoint, it does look like a lot of metadata fields often found on a Dataset item in ESIP wind up only being on the DataCatalog for a stac entry. e.g. using google rich results test:
which will wind up with lots of useful stuff being missed (e.g. spatial coverage, temporal coverage, creator, licence, copyrightHolder, producer, provider, keywords, etc would I think all be cut off from the Dataset search since they aren't properties of the Dataset). Not sure if there's a good way to handle 'inheritance' in this context?
Down the road, it would be really nice if some of the common extensions could also be translated into schema.org. e.g. I think there's a really clean/simple mapping for the scientific citation extension and the table extension into schema.org / ESIP science-on-schema conventions which I'd love to see included. Please let me know if I should open a separate issue for that. Our community may be able to contribute a PR if interested (and I can find who knows javascript well...)
@cboettig Your comments are appreciated, thanks! I don't have the time right now to work on it, but I'll get back to it eventually.
Thanks for the heads up and no worries! Appreciate all the amazing work you're doing here.
Google Search makes good use of STAC Browser: https://www.google.de/search?q=site:mspc.lutana.de
Google DatasetSearch also picks it up, but the data is not ideal yet: https://datasetsearch.research.google.com/search?src=0&query=Planet%20NICFI&docid=L2cvMTF0dDk0bGd6aw%3D%3D
Especially the schema.org data should be improved if possible.