iodepo / odis-arch

Development of the Ocean Data and Information System (ODIS) architecture
https://book.oceaninfohub.org/
26 stars 16 forks source link

EMODNet: validate & index to graph (the new unified EMODNet service) #153

Open jmckenna opened 1 year ago

jmckenna commented 1 year ago

cc @tim-collart

fils commented 1 year ago

I processed this with Gleaner. The results are:

  SitemapCount: 2208
   SitemapStored: 1210
   SitemapIssues: 971

A review of some of the errors show many are like the following.

see also: https://tinyurl.com/y9zxpzqv for the graph in the JSON-LD playground.

ISSUE#1

The description parameter has a string literal value that include quotation marks. This messes up the parsing of the JSON-LD. These need to be removed or escaped out.

For example, at https://emodnet.ec.europa.eu/geonetwork/srv/api/records/f895d2e2-1434-4118-8f4d-4351e3f63beb you can view the source and the JSON-LD. You will find a section like:

        {
        "@type":"DataDownload",
        "contentUrl":"https://www.emodnet-seabedhabitats.eu/access-data/launch-map-viewer/?zoom=8&center=7.67098,53.65892&layerIds=521&baseLayerId=1&activeFilters=NobwRANghgngpgJwJIBMwC4CsAmAjAGjADMBLCAF0VQ0wBZDSLEAZAe1YGsBXAB1QGcMwALoMylBABU4AD3IYwAEQCiABlUBmVbgBsYAL7CgA",
        "encodingFormat":"WWW:LINK-1.0-http--link",
        "name":"EMODnet Seabed Habitats Map Viewer",
        "description":"View map "DE003016" on the EMODnet Seabed Habitats Map Viewer"
        }

ISSUE#2

Related issue on return characters:

 invalid character '\\n' in string literal
jmckenna commented 1 year ago

Here is the schema.org Validator, clearly breaking on that description : https://validator.schema.org/#url=https%3A%2F%2Femodnet.ec.europa.eu%2Fgeonetwork%2Fsrv%2Fapi%2Frecords%2Ff895d2e2-1434-4118-8f4d-4351e3f63beb

Screenshot 2022-12-12 091536

pbuttigieg commented 1 year ago

This is still in EMODNet's court right? Those errors are hard blockers.

bart-v commented 11 months ago

The double quotes are now escaped by applying https://github.com/geonetwork/core-geonetwork/commit/50986d1823fff9bc2eb2b895be72a5d92e5875ac

jmckenna commented 11 months ago

thanks @bart-v I will test a fresh harvest from your unified service, and index into ODIS...

jmckenna commented 9 months ago

@bart-v additional feedback after re-harvesting the unified service:

example record: https://emodnet.ec.europa.eu/geonetwork/srv/api/records/18d9daa1-eee4-4380-a2d0-fd20e4b47081

bart-v commented 9 months ago

I'm sure the institute has 2 emails, feel free to ignore one. This is Geonetwork, so we cannot simply change this structure

jmckenna commented 9 months ago

@bart-v EMODnet Dataset records are now visible on the OIH live search results (direct link to your records).

I had to disable the indexing of your type:Organization instances due to an issue on our side.

emodnet-oih

pbuttigieg commented 2 months ago

@jmckenna @bart-v

the ODISCat entry for EMODnet doesn't tell us where the sitemap is.

J Beja says there should be 2000+ records shared rather than the 709

pbuttigieg commented 2 months ago

is there no valid JSON-LD available for harvest without the patch in our GitHub space ?

if so, then EMODnet is not a functional node yet. This is a major issue, especially as EMODnet is advertising its participation in ODIS

There is valid JSON-LD/schema.org in the example entry here https://emodnet.ec.europa.eu/geonetwork/srv/api/records/18d9daa1-eee4-4380-a2d0-fd20e4b47081

So is this just an issue with completing the ODISCat entry ?

bart-v commented 2 months ago

Few questions

There is a dynamic sitemap here https://emodnet.ec.europa.eu/geonetwork/srv/eng/portal.sitemap?format=rdf (we just fixed a bug in that sitemap). Indexing that should always give you all EMODnet entries.

jmckenna commented 2 months ago

@bart-v those old scripts were a proof-of-concept created a few years ago, but since your unified service we no longer use those scripts. We use your sitemap instead. As @pbuttigieg pointed out, can you (actually Nathalie Tonné) edit your ODISCat entry and add your sitemap link to the ODIS-Arch URL field?

Thanks for pointing to your dynamic sitemap. The sitemap link we usually use (that points to the record pages with embedded JSON-LD) is https://emodnet.ec.europa.eu/geonetwork/srv/eng/portal.sitemap

@bart-v we noticed this morning that your JSON-LD are broken (they do not validate), as the Distribution property has an empty name, such as:

"distribution": [
        {
        "@type":"DataDownload",
        "contentUrl":"https://ows.emodnet.eu/geoserver/pace/ows?SERVICE=WMS&",
        "encodingFormat":"application/vnd.ogc.wms_xml",
        "name": ,
        "description": "https://ows.emodnet.eu/geoserver/pace/ows?SERVICE=WMS&"        }  
    ]

See this sample record, and see it fail in the schema.org validator

jmckenna commented 2 months ago

There is valid JSON-LD/schema.org in the example entry here https://emodnet.ec.europa.eu/geonetwork/srv/api/records/18d9daa1-eee4-4380-a2d0-fd20e4b47081

In fact that is incorrect, that record and all EMODnet records contain invalid JSON-LD, they do not validate (per the empty name property mentioned above).

jmckenna commented 2 months ago

@fils can you give a harvest using this sitemap? I am wondering if that RDF format works for ODIS harvest: https://emodnet.ec.europa.eu/geonetwork/srv/eng/portal.sitemap?format=rdf

bart-v commented 2 months ago

Empty Distribution.name was fixed

pbuttigieg commented 2 months ago

We should stick with the vanilla sitemap

pbuttigieg commented 2 months ago

Thanks for the fixes so far,

Looking better

https://validator.schema.org/#url=https%3A%2F%2Femodnet.ec.europa.eu%2Fgeonetwork%2Fsrv%2Fapi%2Frecords%2F847C10E349EFD10C710A1E3E8260AAC37A38D929

Content issues:

The record is double typed as a Dataset and Organization, this is weird, and we have double name properties and others. Is this an error? If not, it will lead to very confusing results when dealing with EMODnet data - i think Dataset is right here, the organisation would probably be a value of a property therein

the Creator and Author properties indicate EMODnet. Is this correct?  Where's the credit for the original creator? If this was modified by EMODnet biology, they should still be giving credit to the original creators with isBasedOn or similar properties

The distribution values are mostly 'wrong' - they're pointing to landing pages, not direct data downloads. Landing pages can be moved to URL arrays, distributions should be reserved for direct download links (think machine-to-machine data transfer)

bart-v commented 2 months ago

Content issues are being fixed slowly, but that is rather long term.

jmckenna commented 2 months ago

@bart-v I checked the first record in your first sitemap index, but it failed the schema.org validator:

Below is your JSON-LD (I formatted it so it appears nicely). Critical issue is that your includedInDataCatalog is missing a "@type": "DataCatalog", breaking the validator. Also, there are so many parameters with no values. See below:

{
  "@context": "http://schema.org/",
  "@type": "schema:WebAPI",
  "@id": "https://emodnet.ec.europa.eu/geonetwork/srv/api/records/48ba841c-d06a-4c79-ab07-234d913eb975",
  "includedInDataCatalog": [
    {
      "url": "https://emodnet.ec.europa.eu/geonetwork/srv/search#",
      "name": ""
    }
  ],
  "inLanguage": "eng",
  "name": "GeoServer Web Map Service",
  "dateCreated": [],
  "dateModified": [
    "2024-04-22T03:44:00"
  ],
  "datePublished": [],
  "thumbnailUrl": [],
  "description": "A compliant implementation of WMS plus most of the SLD extension (dynamic styling). Can also generate PDF, SVG, KML, GeoRSS",
  "keywords": [
    "WFS",
    "WMS",
    "GEOSERVER",
    "Metadata GDI-Vl-conform"
  ],
  "author": [],
  "contributor": [],
  "creator": [
    {
      "@id": "frederic.leclercq@vliz.be",
      "@type": "Organization",
      "name": "VLIZ",
      "email": "frederic.leclercq@vliz.be",
      "contactPoint": {
        "@type": "PostalAddress",
        "addressCountry": "Belgium",
        "addressLocality": "Oosten",
        "postalCode": "8400"
      }
    }
  ],
  "provider": [
    {
      "@id": "frederic.leclercq@vliz.be",
      "@type": "Organization",
      "name": "VLIZ",
      "email": "frederic.leclercq@vliz.be",
      "contactPoint": {
        "@type": "PostalAddress",
        "addressCountry": "Belgium",
        "addressLocality": "Oosten",
        "postalCode": "8400"
      }
    }
  ],
  "copyrightHolder": [],
  "user": [],
  "sourceOrganization": [],
  "publisher": [],
  "distribution": [
    {
      "@type": "DataDownload",
      "contentUrl": "https://ows.emodnet.eu/geoserver/pace/ows?SERVICE=WMS&",
      "encodingFormat": "application/vnd.ogc.wms_xml",
      "description": "https://ows.emodnet.eu/geoserver/pace/ows?SERVICE=WMS&"
    }
  ],
  "encodingFormat": [
    ""
  ],
  "spatialCoverage": [],
  "license": [
    {
      "@type": "CreativeWork",
      "name": "no conditions apply"
    }
  ]
}
jmckenna commented 2 months ago

Sample working includedInDataCatalog snippet:

    "includedInDataCatalog": {
        "@type": "DataCatalog",
        "name": "MEDIN Discovery Metadata Portal",
        "url": "https://portal.medin.org.uk/portal/",
        "description": "The MEDIN portal contains information about more than 15000 marine datasets from over 400 UK organisations. Metadata are an enduring resource and contact details are publicly available for a long time. Please contact us if you find your contact details on the MEDIN portal and do not consent to this. (enquiries@medin.org.uk)",
        "image": "https://portal.medin.org.uk/grfx/logo.png"
    },
bart-v commented 2 months ago

includedInDataCatalog is fixed All the rest will need to wait for the latest GeoNetwork and some content updates.