connect ZMT catalogue as ODIS node

jmckenna commented 1 year ago

Summary:

existing PANGAEA catalogue
ZMT team are mapping metadata properties through the ODIS Book examples
possible JSON-LD templates to also use:
- minimal Dataset template
- thorough Dataset template
- event time-series example (this was created by another partner)
ZMT team will inform OIH team when sitemap.xml or JSON-LD is ready
ZMT team should also create an entry inside the ODIS Catalogue
- important fields are Startpoint URL for ODIS-Arch (the url to your sitemap) and Type of the ODIS-Arch URL (choose "sitemap")

This issue will allow questions, updates, and discussions by both teams.

cc @fils @pbuttigieg

acwittmann commented 12 months ago

Hi @jmckenna @fils @pbuttigieg @fspreck my colleague @uschindler from PANGAEA tested implementing "Event" for the JSON-LDs of PANGAEA datasets as in the ODIS Book example, see
https://doi.pangaea.de/10.1594/PANGAEA.948712?format=metadata_jsonld&incubation=true
It seems Google Search is quite particular when it comes to using the term "Event", as he promptly received the following error message. We may have to stick with working with temporal and spatial coverage, unless we (ZMT & OIH) do not need to worry about Google.

Problems of type "Structured Data Events" detected on doi.pangaea.de

To the owner of doi.pangaea.de:

The Search Console has identified that your website is affected by 13 problem(s) of type "Structured Data Events". The following problems have been found on your website. We recommend addressing these issues, if possible, to ensure optimal functioning and high visibility in Google search results. Most common critical issues*

Missing "startDate" field

Missing "location" field

*Critical issues prevent a page or feature from appearing in search results. Most common non-critical issues‡

Missing "offers" field

Missing "performer" field

Missing "eventAttendanceMode" field

Missing "eventStatus" field

Missing "image" field

‡Non-critical issues are suggestions for improvement. They do not prevent a page or feature from appearing in Google search results. Some non-critical issues may negatively impact the display in search results, while others may be escalated to critical issues later on.

uschindler commented 12 months ago

To add more information: The problem comes from the "Event" in english language having more than one meaning, in Schema.org used as the German word "Veranstaltung" (artistical event) not as abstract "Ereignis" (generic event like in PANGAEA).

The problem with Google interpreting the "subjectOf" relation is that the dataset is now linked to an artistic event. Google extracts from the datasets multiple events and also wants to publish them separately to the dataset as "artistic event", so at end it will work like "User searches for movie name" and google presents events related to that. They extract all events from a given page (in our case a dataset) because in most cases cinema homepages have a list of events for a specific cinema hall, so for datasets they also expect multiple events as separate entities.

As PANGAEA wants to prevent that its events are shown as artistic events in google search we have to stop adding events to schema.org, as it is the wrong entity type.

P.S.: I am in contact with Natasha Noy regarding this.

TimmFitschen commented 9 months ago

@jmckenna Hi, I am preparing the sitemap and json+ld resources. The documentation says that the crawlers expect a script tag inside of a html document.

<script type="application/ld+json">JSON_LD content</script>

Is it possible to direct the crawler to a json+ld file directly? I mean, I know how to do this in the sitemap. The question is, rather, will the crawler accept that as well? Or do we need the "detour" via the html document?

jmckenna commented 9 months ago

@TimmFitschen if you're asking just about Google and other search engines, they expect the JSON-LD to be inline only (see related StackExchange thread)>. But I believe ODIS itself will accept it (@fils can you confirm?).

uschindler commented 9 months ago

Hi, Yes the source must be inside the script tag and therefore in the html. Technically it would be correct to add a href attribute to link to an url. This would be better for mobile browsers, as the transfer size gets smaller, but according to documentation this is not allowed.

I have contact to Google, maybe there's a change. An easy way is to simply test it. After setting it up with a href Link you can run the Google structured data analyzer.

P.S.: PANGAEA also delivers the Schema.org, when you do a content negotiation on landing page using accept header (see signposting.org). The FUJI fair checker also uses content negotiation, if available.

jmckenna commented 9 months ago

@uschindler interesting, I'm curious of Google's updated view on this, keep us posted.

fspreck-indiscale commented 8 months ago

Hi,

we now have the sitemap with all public datasets online. Is it possible to do a crawler test run before we enter it to the ODIS catalogue?

fspreck-indiscale commented 7 months ago

Hi @jmckenna, we updated our Jsons according to the things we discussed last time. Can you run your tests again against the sitemap and check whether

@id looks fine
publisher is now the correct property and located correctly within the json
spatialCoverage works in form of an array of Places each specified by a GeoCoordinates object instead of the boxes we had before

Thank you!

jmckenna commented 7 months ago

updates since today's meeting:

we now handle the type:GeoCoordinates as points in the ODIS front-end spatial search (see screen capture below of the ZMT spatial records)

zmt-geocoordinates

42 records are missing a spatialCoverage such as https://dataportal.leibniz-zmt.de/Entity/19378

uschindler commented 7 months ago

Hi,

42 records are missing a spatialCoverage such as https://dataportal.leibniz-zmt.de/Entity/19378

This PANGAEA one has no spatial coverage. That's not an issue in your portal.

jmckenna commented 7 months ago

@uschindler today in the meeting I had mentioned that some records in the ZMT sitemap do not have spatialCoverage and the reaction from the ZMT team I believe was "all records should have spatialCoverage", so, I am not understanding your response.

uschindler commented 7 months ago

@uschindler today in the meeting I had mentioned that some records in the ZMT sitemap do not have spatialCoverage and the reaction from the ZMT team I believe was "all records should have spatialCoverage", so, I am not understanding your response.

The problem is that the link posted is about data harvested from PANGAEA: https://dataportal.leibniz-zmt.de/Entity/19378; this entry refers to this PANGAEA dataset: https://doi.org/10.1594/PANGAEA.890177

This one has no spatial coverage and will never have one. It is correct. If you harvest PANGAEA, you have to live with the fact that datasets may not have a coverage. I won't try to explain this here why there's no coverage available, but in short: it is not mandatory and for this dataset there's no way to provide a coverage. It has none.

fspreck-indiscale commented 7 months ago

@jmckenna @uschindler, sorry that was too bold a claim, then. And it will be even more so in the future, unfortunately, once we included more non-PANGAEA dataset in the portal -- they will most probably not have geo information at all.

fspreck-indiscale commented 7 months ago

@jmckenna How does your frontend treat entries like https://dataportal.leibniz-zmt.de/Entity/18288 (view-source:https://dataportal.leibniz-zmt.de/oih/dataset_18288.html for the json, respectively) where we have an array of places in the spatial coverage? There should be a lot more points than datasets if you show the full array on your map (~900 locations vs ~150 datasets).

jmckenna commented 7 months ago

@fspreck-indiscale good point, we don't handle a list of geocoordinates yet, but we should. (we only use the first point) Thanks for reporting this.

fspreck-indiscale commented 7 months ago

@jmckenna Hi, I just updated the JSONs again; they now have sdPublisher and creditText.

jmckenna commented 7 months ago

thanks @fspreck-indiscale, will do another harvest here...

fspreck-indiscale commented 7 months ago

@jmckenna We added keywords (simple array of strings for now) to some of the datasets; do they look good after harvesting?

The schema.org validator passes.

jmckenna commented 5 months ago

updates from meeting on 2024-01-15:

ODISCat entry made: https://catalogue.odis.org/view/3289
keywords were fixed
preference for frequency of harvesting into ODIS: monthly, as specified in the sitemap

jmckenna commented 3 months ago

@fspreck-indiscale thanks for updating the keywords syntax. I notice that some have odd characters inside the JSON-LD, such as this record:

landing page: https://dataportal.leibniz-zmt.de/Entity/19689
JSON-LD: https://dataportal.leibniz-zmt.de/oih/dataset_19689.html

 "keywords": [
  "coral climatology",
  "oxygen isotope",
  "trace elements ratio",
  "\u03b418Oseawater"
 ],

fspreck-indiscale commented 3 months ago

Hi @jmckenna, good point, we've not considered these characters so far. Escaping non-ASCI is the safe default of the exporter but by no means is it necessary (on the landing page, it's UTF8 δ). May we use UTF8 strings in the JSON-LD?

jmckenna commented 3 months ago

Hi @fspreck-indiscale, in fact on the ODIS search front-end it appears as follows, so I think it is OK to use these unicode characters. (does that keyword look ok here in this screen capture to you?)

unicode

jmckenna commented 3 months ago

@fspreck-indiscale the ZMT records (201) are now on the production server ( https://oceaninfohub.org/ ).

There is an issue however, on our side: the "Provider" facet lists 2 different providers for your records: "Leibniz Centre for Tropical Marine Research, Bremen, Germany" and then "Leibniz Center for Tropical Marine Research (ZMT)" (the second one comes from the name in the ODIS config). It seems the provider name in the JSON-LD and the prov:wasAttributedTo name are both being used here for some reason (again, this is a problem on our front-end/indexing side).

Here is the harvested JSON-LD example: https://api.search.oceaninfohub.org/source?id=https%3A%2F%2Fdataportal.leibniz-zmt.de%2Foih%2Fdataset_19754.html&_gl=1*1qbvbqk*_ga*NjkyMjg3NDkwLjE3MTIzNDMxMjM.*_ga_MQDK6BB0YQ*MTcxMjM0MzEyMy4xLjEuMTcxMjM0NzExNC4wLjAuMA..*_ga_QJ5XJMZFXW*MTcxMjM0MzEzNi4xLjEuMTcxMjM0NzExNC4wLjAuMA..

zmt1

@pbuttigieg @fils can you see the source of the problem here?

jmckenna commented 3 months ago

More info: the records harvested inside Solr (search index) contain only one provider:

"txt_provider":["Leibniz Centre for Tropical Marine Research, Bremen, Germany"]

This is puzzling.

jmckenna commented 3 months ago

Ah, it could be that no other partner is setting "provider" to themself. CIOOS uses "provider" to point to their regional partner who 'provides' the catalogue (such as CIOOS-Atlantic, or CIOOS-Pacific).

Example JSON-LD for CIOOS record: https://api.search.oceaninfohub.org/source?id=https%3A%2F%2Fcatalogue.cioos.ca%2Fdataset%2F777530f0-adaf-4ddb-86bb-6f1269dcb259.jsonld&_gl=1*15ety5n*_ga*MTQ3NjY4NzQyNy4xNzEyMzIxMTE5*_ga_MQDK6BB0YQ*MTcxMjM1MTk2Mi4zLjEuMTcxMjM1MjI2Ni4wLjAuMA..*_ga_QJ5XJMZFXW*MTcxMjM1MTk3MS4zLjEuMTcxMjM1MjI2Ni4wLjAuMA..

I'd need @pbuttigieg @fils to clarify how we should assume the correct use of "provider" is.

Maybe we should setup another ZMT-ODIS technical meeting in the next 2 weeks, to examine this together.

pbuttigieg commented 3 months ago

From the node perspective, the provider should be the entity that provided them with the data that the JSON-LD record is about.

they have sdPublisher for identifying the entity that created the JSON-LD

iodepo / odis-arch

connect ZMT catalogue as ODIS node #276