Open jmckenna opened 1 year ago
Hi @jmckenna @fils @pbuttigieg @fspreck
my colleague @uschindler from PANGAEA tested implementing "Event" for the JSON-LDs of PANGAEA datasets as in the ODIS Book example, see
https://doi.pangaea.de/10.1594/PANGAEA.948712?format=metadata_jsonld&incubation=true
It seems Google Search is quite particular when it comes to using the term "Event", as he promptly received the following error message. We may have to stick with working with temporal and spatial coverage, unless we (ZMT & OIH) do not need to worry about Google.
Problems of type "Structured Data Events" detected on doi.pangaea.de
To the owner of doi.pangaea.de:
The Search Console has identified that your website is affected by 13 problem(s) of type "Structured Data Events". The following problems have been found on your website. We recommend addressing these issues, if possible, to ensure optimal functioning and high visibility in Google search results. Most common critical issues*
Missing "startDate" field
Missing "location" field
*Critical issues prevent a page or feature from appearing in search results. Most common non-critical issues‡
Missing "offers" field
Missing "performer" field
Missing "eventAttendanceMode" field
Missing "eventStatus" field
Missing "image" field
‡Non-critical issues are suggestions for improvement. They do not prevent a page or feature from appearing in Google search results. Some non-critical issues may negatively impact the display in search results, while others may be escalated to critical issues later on.
To add more information: The problem comes from the "Event" in english language having more than one meaning, in Schema.org used as the German word "Veranstaltung" (artistical event) not as abstract "Ereignis" (generic event like in PANGAEA).
The problem with Google interpreting the "subjectOf" relation is that the dataset is now linked to an artistic event. Google extracts from the datasets multiple events and also wants to publish them separately to the dataset as "artistic event", so at end it will work like "User searches for movie name" and google presents events related to that. They extract all events from a given page (in our case a dataset) because in most cases cinema homepages have a list of events for a specific cinema hall, so for datasets they also expect multiple events as separate entities.
As PANGAEA wants to prevent that its events are shown as artistic events in google search we have to stop adding events to schema.org, as it is the wrong entity type.
P.S.: I am in contact with Natasha Noy regarding this.
@jmckenna Hi, I am preparing the sitemap and json+ld resources. The documentation says that the crawlers expect a script tag inside of a html document.
<script type="application/ld+json">JSON_LD content</script>
Is it possible to direct the crawler to a json+ld file directly? I mean, I know how to do this in the sitemap. The question is, rather, will the crawler accept that as well? Or do we need the "detour" via the html document?
@TimmFitschen if you're asking just about Google and other search engines, they expect the JSON-LD to be inline only (see related StackExchange thread)>. But I believe ODIS itself will accept it (@fils can you confirm?).
Hi, Yes the source must be inside the script tag and therefore in the html. Technically it would be correct to add a href attribute to link to an url. This would be better for mobile browsers, as the transfer size gets smaller, but according to documentation this is not allowed.
I have contact to Google, maybe there's a change. An easy way is to simply test it. After setting it up with a href Link you can run the Google structured data analyzer.
P.S.: PANGAEA also delivers the Schema.org, when you do a content negotiation on landing page using accept header (see signposting.org). The FUJI fair checker also uses content negotiation, if available.
@uschindler interesting, I'm curious of Google's updated view on this, keep us posted.
Hi,
we now have the sitemap with all public datasets online. Is it possible to do a crawler test run before we enter it to the ODIS catalogue?
Hi @jmckenna, we updated our Jsons according to the things we discussed last time. Can you run your tests again against the sitemap and check whether
@id
looks finepublisher
is now the correct property and located correctly within the jsonspatialCoverage
works in form of an array of Place
s each specified by a GeoCoordinates
object instead of the boxes we had beforeThank you!
updates since today's meeting:
type:GeoCoordinates
as points in the ODIS front-end spatial search (see screen capture below of the ZMT spatial records)spatialCoverage
such as https://dataportal.leibniz-zmt.de/Entity/19378Hi,
- 42 records are missing a
spatialCoverage
such as https://dataportal.leibniz-zmt.de/Entity/19378
This PANGAEA one has no spatial coverage. That's not an issue in your portal.
@uschindler today in the meeting I had mentioned that some records in the ZMT sitemap do not have spatialCoverage and the reaction from the ZMT team I believe was "all records should have spatialCoverage", so, I am not understanding your response.
@uschindler today in the meeting I had mentioned that some records in the ZMT sitemap do not have spatialCoverage and the reaction from the ZMT team I believe was "all records should have spatialCoverage", so, I am not understanding your response.
The problem is that the link posted is about data harvested from PANGAEA: https://dataportal.leibniz-zmt.de/Entity/19378; this entry refers to this PANGAEA dataset: https://doi.org/10.1594/PANGAEA.890177
This one has no spatial coverage and will never have one. It is correct. If you harvest PANGAEA, you have to live with the fact that datasets may not have a coverage. I won't try to explain this here why there's no coverage available, but in short: it is not mandatory and for this dataset there's no way to provide a coverage. It has none.
@jmckenna @uschindler, sorry that was too bold a claim, then. And it will be even more so in the future, unfortunately, once we included more non-PANGAEA dataset in the portal -- they will most probably not have geo information at all.
@jmckenna How does your frontend treat entries like https://dataportal.leibniz-zmt.de/Entity/18288 (view-source:https://dataportal.leibniz-zmt.de/oih/dataset_18288.html for the json, respectively) where we have an array of places in the spatial coverage? There should be a lot more points than datasets if you show the full array on your map (~900 locations vs ~150 datasets).
@fspreck-indiscale good point, we don't handle a list of geocoordinates yet, but we should. (we only use the first point) Thanks for reporting this.
@jmckenna Hi, I just updated the JSONs again; they now have sdPublisher
and creditText
.
thanks @fspreck-indiscale, will do another harvest here...
@jmckenna We added keywords (simple array of strings for now) to some of the datasets; do they look good after harvesting?
The schema.org validator passes.
updates from meeting on 2024-01-15:
@fspreck-indiscale thanks for updating the keywords syntax. I notice that some have odd characters inside the JSON-LD, such as this record:
"keywords": [
"coral climatology",
"oxygen isotope",
"trace elements ratio",
"\u03b418Oseawater"
],
Hi @jmckenna, good point, we've not considered these characters so far. Escaping non-ASCI is the safe default of the exporter but by no means is it necessary (on the landing page, it's UTF8 δ). May we use UTF8 strings in the JSON-LD?
Hi @fspreck-indiscale, in fact on the ODIS search front-end it appears as follows, so I think it is OK to use these unicode characters. (does that keyword look ok here in this screen capture to you?)
@fspreck-indiscale the ZMT records (201) are now on the production server ( https://oceaninfohub.org/ ).
There is an issue however, on our side: the "Provider" facet lists 2 different providers for your records: "Leibniz Centre for Tropical Marine Research, Bremen, Germany"
and then "Leibniz Center for Tropical Marine Research (ZMT)"
(the second one comes from the name in the ODIS config). It seems the provider
name in the JSON-LD and the prov:wasAttributedTo
name are both being used here for some reason (again, this is a problem on our front-end/indexing side).
Here is the harvested JSON-LD example: https://api.search.oceaninfohub.org/source?id=https%3A%2F%2Fdataportal.leibniz-zmt.de%2Foih%2Fdataset_19754.html&_gl=1*1qbvbqk*_ga*NjkyMjg3NDkwLjE3MTIzNDMxMjM.*_ga_MQDK6BB0YQ*MTcxMjM0MzEyMy4xLjEuMTcxMjM0NzExNC4wLjAuMA..*_ga_QJ5XJMZFXW*MTcxMjM0MzEzNi4xLjEuMTcxMjM0NzExNC4wLjAuMA..
@pbuttigieg @fils can you see the source of the problem here?
More info: the records harvested inside Solr (search index) contain only one provider:
"txt_provider":["Leibniz Centre for Tropical Marine Research, Bremen, Germany"]
This is puzzling.
Ah, it could be that no other partner is setting "provider"
to themself. CIOOS uses "provider"
to point to their regional partner who 'provides' the catalogue (such as CIOOS-Atlantic, or CIOOS-Pacific).
Example JSON-LD for CIOOS record: https://api.search.oceaninfohub.org/source?id=https%3A%2F%2Fcatalogue.cioos.ca%2Fdataset%2F777530f0-adaf-4ddb-86bb-6f1269dcb259.jsonld&_gl=1*15ety5n*_ga*MTQ3NjY4NzQyNy4xNzEyMzIxMTE5*_ga_MQDK6BB0YQ*MTcxMjM1MTk2Mi4zLjEuMTcxMjM1MjI2Ni4wLjAuMA..*_ga_QJ5XJMZFXW*MTcxMjM1MTk3MS4zLjEuMTcxMjM1MjI2Ni4wLjAuMA..
I'd need @pbuttigieg @fils to clarify how we should assume the correct use of "provider"
is.
Maybe we should setup another ZMT-ODIS technical meeting in the next 2 weeks, to examine this together.
From the node perspective, the provider should be the entity that provided them with the data that the JSON-LD record is about.
they have sdPublisher for identifying the entity that created the JSON-LD
Summary:
Startpoint URL for ODIS-Arch
(the url to your sitemap) andType of the ODIS-Arch URL
(choose "sitemap")This issue will allow questions, updates, and discussions by both teams.
cc @fils @pbuttigieg