ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Schema.org JSONLD duplication per-dataset & Google DSS #228

Closed mwengren closed 4 years ago

mwengren commented 4 years ago

While testing some results from Google DSS, I noticed that CKAN actually duplicates the Schema.org output for each dataset. I think Google is reading the inline content rather than what is written to the linked .jsonld file.

You can confirm this by running a test of the 'Structured Dataset Testing Tool'. As an example, take this SCCOOS Glider dataset sp035-20200219T1918 that's currently active:

https://data.ioos.us/dataset/sp035-20200219t19182cb9a

This link shows the test results using the Google dataset testing tool: https://search.google.com/structured-data/testing-tool/u/0/#url=https%3A%2F%2Fdata.ioos.us%2Fdataset%2Fsp035-20200219t19182cb9a

The content on the left side of that page shows where it's actually sourcing the Schema.org JSONLD metadata - it's an inline script, rather than the full JSONLD file that is linked as an 'application/ls+json' embedded link (which for this dataset points to:"https://data.ioos.us/dataset/ad38783a-903b-479c-a8ec-2d6d9e2faa79.jsonld currently).

It seems we'll need to tweak CKAN to update the inline HTML content rather than the JSONLD file if we want Google DSS results to show the new metadata we've been trying to add.

Here's a Google DSS search link that shows this dataset has been re-indexed within the last week, but still doesn't show our updated content:

https://datasetsearch.research.google.com/search?query=site%3A%20data.ioos.us%20sp035-20200219T1918%20&docid=bCi3f2vLfNtHFF2bAAAAAA%3D%3D

The good news is that Google seems to follow the internal @id references within the inline JSON-LD, so as long as we can update the relevant content, we should see better results. Look at the GeoShape element at the bottom of the lower right pane in the structured data testing tool link for a good example.

mwengren commented 4 years ago

For reference, here's the equivalent Google structured testing tool results for the same glider dataset hosted in ERDDAP by Axiom:

https://search.google.com/structured-data/testing-tool/u/0/#url=http%3A%2F%2Ferddap.axiomdatascience.com%2Ferddap%2Finfo%2Fsp035_20200219t1918%2Findex.html

benjwadams commented 4 years ago

It's not duplicated per se -- ckanext-dcat offers a .jsonld endpoint for datasets by default and one embedded in the page if you include the structured_data plugin provided by ckanext-dcat. Both use different means to declare a custom schema, which is why the two contained different results.

Somehow I missed this:

https://github.com/ckan/ckanext-dcat#structured-data-and-google-dataset-search-indexing

In any event, it's now in the embedded <script> tag due to changes made in 69153105ba310c01f99385941899519a6634984c. Hopefully this will fix some of the issues with Google Dataset Search, so I'm going to close.

mwengren commented 4 years ago

@benjwadams For some reason, adding the ckanext-dcat module output to the inline JSONLD content adds a whole bunch of validation errors in Google tests. I think we may be better off seeing if we can insert our custom content elsewhere (perhaps alongside the code that outputs the inline Schemar.og JSONLD excluding the dcat: content), and abandoning the ckanext-dcat module. We can discuss.

Compare the output of these two tests. I saved the output from the same Glider DAC dataset before and after you made the changes above, and created gists to feed Google structured data testing tool.

Google structured data testing tool sp035-20200219T1918 (before ckanext-dcat content added to inline JSONLD)

Google structured data testing tool sp035-20200219T1918 (after ckanext-dcat content added to inline JSONLD)

benjwadams commented 4 years ago

Removed euro_dcat_ap profile which was adding non-schema.org elements and removed extra geometry types which were unused in ec04c103868366e2b19dec66cc62d4d716307a97..099ec91b5164ed2cc5efc49670aac0561c5d2f83

benjwadams commented 4 years ago

I ran some results against the Structured Data Testing Tool and the results look much better. The sitemap is continuing to be updated as well. Unfortunately, Google Dataset Search's choice of when/what to harvest is a bit of a black box to me and I haven't seen any updated results yet. Let's wait a week or so and see if any of the desired changes are picked up.

mwengren commented 4 years ago

@benjwadams I circled back and took a look at Google DSS today, good news is it looks like it's working now, at least for some datasets. Here's one of the GCOOS datasets recently harvested:

https://datasetsearch.research.google.com/search?query=site%3A%20data.ioos.us%20LATEX%20CTD%20-%20d94j123.nc%20-%2027.68N%2C%2095.97W%20-%201994-11-10

Some Glider DAC datasets we've been testing with don't appear to generate the bounding box (at least this one anyway):

https://datasetsearch.research.google.com/search?query=site%3A%20data.ioos.us%20UW157-20190916T0000&docid=kbza4eKPBcU3xzn5AAAAAA%3D%3D

But, this OOI Endurance Array dataset does, hooray!

https://datasetsearch.research.google.com/search?query=site%3A%20data.ioos.us%20ce_383-20200220T2031&docid=jQ6tGXKdwqlTpGXOAAAAAA%3D%3D

I still need to find a few good examples to include in the release notes, so I'll check again next week, but hopefully we can close the book on this issue and get Release 1.5 out.