stream-project / WP4R

Work Package 4 (Leitung:Infai)
1 stars 0 forks source link

Create DCAT endpoint for NOMAD #2

Closed simeonackermann closed 3 years ago

simeonackermann commented 3 years ago

See NOMAD DCAT Conceptual Mapping for mapping schema

See #24 for DCAT endpoint requirements

javadch commented 3 years ago

Some RML processors:

markus1978 commented 3 years ago

Thx @javadch . It appears that RML doesn't had its entry to the Python world yet. Anyhow, for the beginning i simply use Python's rdflib to create the results. For now the mapping is quite straightforward.

markus1978 commented 3 years ago

I implemented a first version of the API. Its currently running on our development cluster. This is one has only some example data, but it otherwise it is the same thing.

Here is the API dashboard: https://nomad-lab.eu/dev/nomad/dcat/dcat/ Here is an example output: https://nomad-lab.eu/dev/nomad/dcat/dcat/datasets/xtmcIQCXIbY_PQAKDHSdr7Z9fjQO The example output is for this entry here: https://nomad-lab.eu/dev/nomad/dcat/gui/entry/id/N7KSXZ5OT6mcbhMe80LcbA/xtmcIQCXIbY_PQAKDHSdr7Z9fjQO

Here is the mapping that I have implemented so far: https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/blob/dcat/nomad/app/dcat/mapping.py Its pretty straightforward and its should be easy to extends.

@javadch can you give some feedback, if this is so far what you had in mind?

@mospring let me know when you want to work and this and I'll explain how to do it.

javadch commented 3 years ago

@markus1978 it's a very good starting point.

  1. Is it possible to generate in Turtle? or accept a parameter to the API to declare the serialization format?
  2. Try to externalize the mappings in RML (or any other specification) to shield the API codes from future changes in the mappings.
javadch commented 3 years ago

@markus1978 it's a very good starting point.

  1. Is it possible to generate in Turtle? or accept a parameter to the API to declare the serialization format?
  2. Try to externalize the mappings in RML (or any other specification) to shield the API codes from future changes in the mappings.

I noticed that my first comment is already implemented 👍🏻

markus1978 commented 3 years ago

I looked through the available RML implementations. Unfortunately, there seems to be no Python implementation. Since the mapping is rather simple, I do not see enough benefits to warrant the added technical complexity that including any non Python components would entail.

The current mapping implementation is externalised in its own module to provide some decoupling from the actual API implementation.

TBoonX commented 3 years ago

Because Maja asked me, from our side it is not that important how the mapping is realized exactly in the code. Of course if a common library with (R2)RML mapping files could be used for this, it would be nice not only for reusable reasons for the DSMS but also to have a way to adopt the mapping without working on the code. But making progress is important and if everyone is OK with having change requests here in Github it's alright.

javadch commented 3 years ago

No problem there, the R2RMl popped up because it promised in the proposal. So, I guess If we need to report progress on that task or there is any deliverable regarding that, we should consider building some RML mappings or justify our decision. Also, it can be again needed when we map the data.

simeonackermann commented 3 years ago

See https://github.com/stream-project/WP4R/issues/24 for DCAT endpoint requirements

simeonackermann commented 3 years ago

Good work! Beside the /dataset endpoints the CKAN harvester requires a single endpoint to fetch all dataset IDs from Nomad (eg /catalog). It might have a limit and pagination and if possible a parameter like modified_since (see #24 for more info).

simeonackermann commented 3 years ago

@markus1978 are there any updates implementing an /catalog endpoint to get a list of all datasets? (See als #24 for details)

markus1978 commented 3 years ago

I don't know. @mospring I can certainly help with this. But you have to plan/moderate these activities, as I am not really part of this project. Some kind of due date for example?

Some limits/deviations from #24:

Do these pose any major problems?

simeonackermann commented 3 years ago

What you mean with "after_value style pagination"? Datasets with unchanged DCAT values are ok (and I maybe quite right), they just increase the request/return size and synchronization process.

Do you think it's possible to implement such an endpoint still this year?

markus1978 commented 3 years ago

Instead of an integer value that identifies a page, like page 1, page 2, ... you provide the last value of the previous page to get the next page. Each request basically returns the next page size values "after" the given value. Usually an API would give you the respective value for the next page alongside the page data. I am not sure, how to do this when the API responses are RDF documents though. In typical json-based APIs, responses would contain a key "after_value" or "next_page_url" or something like this.

It should not take long to implement this.

markus1978 commented 3 years ago

I added a first implementation of the catalog endpoint. Here is the API; and here an example request.

There is only a catalog instance in there that lists the datasets, which are also in the results. I don't know, if you need more attributes on the catalog.

In the confines of a dcat:catalog, I am not sure, how to communicate basic data about the next page (like an URL to the "next" catalog?) or how many entries there are, i.e. how much pages to expect?

simeonackermann commented 3 years ago

Cool! First I think its not necessary to put all dcat data of the datasets in the result. This increases the request/result and the CKAN sync plugin requests each dataset anyway. So for the /catalog endpoint I suggest to have the properties: rdf:type, dct:description, dct:title, dct:modified, dct:identifier, dct:issued to keep it slim.

The pagination problematic solves CKAN with Hydra, which includes on each catalog page a paging entry like the following:

@prefix hydra: <http://www.w3.org/ns/hydra/core#> .

<http://example.com/catalog.ttl?page=1> a hydra:PagedCollection ;
    hydra:firstPage "http://example.com/catalog.ttl?page=1" ;
    hydra:itemsPerPage 100 ;
    hydra:lastPage "http://example.com/catalog.ttl?page=3" ;
    hydra:nextPage "http://example.com/catalog.ttl?page=2" ;
    hydra:totalItems 283 .

May this can be a solution?

markus1978 commented 3 years ago

Sorry, I still did not have time fix this yet. I hope I find some time next week.

On a first glance, I think I can use hydra:next (or hydra:nextPage), hydra:first (or hydra:firstPage), hydra:totalItems, hydra:itemsPerPage to describe the "after"-style pagination. Without actual pages in elastic search, I would rather use next than nextPage. Also I cannot provide hydra:lastPage.

markus1978 commented 3 years ago

I added a hydra object to the catalog and reduced the properties of the datasets. Here is the API; and here an example request.

I forgot to add the other parameters to the hydra URLs. Despite this, is this something you can work with?

Also rdflib is outputing a list of objects that are linked via references. In many cases this feels unnecessary and datasets should be children of catalog, the hydra collection should also be a child of catalog?

simeonackermann commented 3 years ago

Nice, looks good, much cleaner now. The Hydra instance may should be a root element, like the w3 example suggest. Its may also good being CKANish as they do it the same (here). For this we have to the change the URL of dcat:Catalog, which could be the base URL of Nomad. And the Hydra instance gets the catalog endpoint URL.

May you can change the HTTP content type, depending on the requested format. If its n3, turtle, nt it should text/turtle or text/n3 not application/xml.

markus1978 commented 3 years ago

Sorry for the late replay; holidays and all.

I fixed the content type issue and its changing now according to the requested format.

I am afraid, I do not understand the other proposed changes. Could your exemplify this some more?

simeonackermann commented 3 years ago

Content-Type fix works like a charm.

I try to clarify my Hydra request. The best/easiest would be, if the Nomad DCAT API could be equal with the CKAN API.

As you can see in the ckan-dcat doc they implement Hydra with the hydra:PagedCollection class and not as a childnode of dcat:Catalog. Here is an example catalog

In your case it could look like:

<https://nomad-lab.eu/dev/nomad/dcat/dcat/catalog> a hydra:PagedCollection ;
    hydra:firstPage "https://nomad-lab.eu/dev/nomad/dcat/dcat/catalog" ;
    hydra:nextPage "https://nomad-lab.eu/dev/nomad/dcat/dcat/catalog?after=_ZwH__nv9nnoX-j0BtDHcEAuFUFU" ;
    hydra:totalItems 8350 .

<https://nomad-lab.eu/dev/nomad/dcat> a dcat:Catalog ;
    dct:dataset <https://nomad-lab.eu/dev/nomad/dcat/dcat/datasets/0C-M84ujAcLhotDqpl6caWz1XjE->,
        <https://nomad-lab.eu/dev/nomad/dcat/dcat/datasets/BPVe5K-TVJl1PKnrkajslFkI-Xw0>,
        ...
markus1978 commented 3 years ago

Ok, I think, I got it now. Since there is no page=1, I added an empty start as the hydra ref to emulate the ckan-dcat behaviour without having to change the URL path. It looks like this now:

<?xml version="1.0" encoding="utf-8"?>
  <rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dct="http://purl.org/dc/terms/"
  xmlns:dcat="http://www.w3.org/ns/dcat#"
  xmlns:hydra="http://www.w3.org/ns/hydra/core#"
>
    <hydra:collection rdf:about="https://nomad-lab.eu/dev/nomad/dcat/dcat/catalog?after=">
      <hydra:totalItems rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">8350</hydra:totalItems>
      <hydra:first rdf:resource="https://nomad-lab.eu/dev/nomad/dcat/dcat/catalog?after="/>
      <hydra:next rdf:resource="https://nomad-lab.eu/dev/nomad/dcat/dcat/catalog?after=-4VfkBxzyky8ffhLXVbNUGK2F6ej"/>
      <rdf:type rdf:resource="http://www.w3.org/ns/hydra/core#Collection"/>
    </hydra:collection>
    <dcat:Catalog rdf:about="https://nomad-lab.eu/dev/nomad/dcat/dcat/catalog">
      <dct:dataset>
        <dcat:Dataset rdf:about="https://nomad-lab.eu/dev/nomad/dcat/dcat/datasets/-3eGrOFkZ5afzB5xR5Tn7HFvk2sl">
          <dct:identifier>-3eGrOFkZ5afzB5xR5Tn7HFvk2sl</dct:identifier>
          <dct:title>Cu32O16</dct:title>
          <dct:modified>2020-12-16T06:23:27.357756</dct:modified>
          <dct:description>Force calculations needed for thermal conductivity calculations</dct:description>
          <dct:issued>2018-09-19T14:07:44.437000+00:00</dct:issued>
        </dcat:Dataset>
      </dct:dataset>
      ...
    </dcat:Catalog>
  </rdf:RDF>

I feel, we are almost there. If you agree, I will fix some remaining minor issues and merge this for the next actual NOMAD release.

simeonackermann commented 3 years ago

Jep, I think thats fine. I have to solve some issues with CKANs deprecated Hydra implementation and test everything on wednesday. But I think thats it.

simeonackermann commented 3 years ago

Ok, I have a new point: the current format query should maybe also be part of hydra prev/next/fist pagination URL.

My first test with CKAN to sync data from Nomad synced 4500 datasets without an error.

simeonackermann commented 3 years ago

I recognized a wrong behavior (?) with hydras next page. The catalog with all modified datasets since 2021-01-01 says there are 580 items (see endpoint here). But If you access the next page from here, its 8350 items with modified dates from last year.

markus1978 commented 3 years ago

Ok, I'll look into it.

Be aware that this is still running on our dev servers and it's not the real data.

markus1978 commented 3 years ago

I fixed the issue with the missing filter on the next page.

I am not sure if I should really add the format parameter to the hydra (and other urls). As far as I understand the urls identify RDF objects, but the format does not contribute to the respective identities. It should it not be the same object, no matter how it is formatted? I guess it would be more right, to don't use a format parameter at all, but a HTTP request header with the expected content-type instead? What do you think?

simeonackermann commented 3 years ago

To move the requested content type in the request header would be very nice and common! (although is not required for the CKAN harvester)

markus1978 commented 3 years ago

I would go with Content-type "text/turtle", "text/n3", etc. XML is different of course. How would i distinguish xml/pretty-xml, which technically would both be "application/xml"? Or should pretty-xml (the xml with a hierarchy instead of lots of flat objects with references) be the default and only supported XML?

simeonackermann commented 3 years ago

Hm, I can't find anything on pretty-XML content type ... Some RDF types here, may just use application/rdf+prettyxml

markus1978 commented 3 years ago

Thanks for the list of content-types. I'll go ahead and add the necessary behaviour. I leave the old format parameter as a backup?

simeonackermann commented 3 years ago

Ya, the format parameter simplifies short API requests with the browser.

markus1978 commented 3 years ago

I finally added the header handling. It will now interpret the "Accept" header in the request and use the right format and set the response "Content-type" accordingly. If there is a format parameter, it will ignore the header.

This NOMAD dcat API is now also part of our actual release and NOMAD data: https://nomad-lab.eu/prod/rae/dcat/

TBoonX commented 3 years ago

The API is working. Only a part of the datasets are available but for this issue its absolutely fine!

TBoonX commented 3 years ago

@markus1978 Working on the DCAT extension revealed that only the /catalog route is used. Atm only the metadata from the /catalog route is imported. How do you think about combining the two routes into one? Would it be much work for you? I could also try to partly rewrite the DCAT Extension to also call the /dataset route but this would result in a lot of work. I prefer the route combination by FHI if it could be realized in a short time.

markus1978 commented 3 years ago

You want that the catalog returns the full dataset objects in the same way the datasets route does? It should not be too hard. Should this be configurable through a catalog route query parameter? Or catalog simply always return the full datasets?

TBoonX commented 3 years ago

Exactly, the catalog route should also contain everything from the /dataset route for each dataset. No need for a parameter.

markus1978 commented 3 years ago

Ok. I think I got it. It takes some time to build and deploy.

markus1978 commented 3 years ago

We have the fix on our beta deployment now: https://nomad-lab.eu/prod/rae/beta/dcat/ It is the same data as on the official nomad. The fix will go to the official nomad, when we do the next release. Probably next week.

TBoonX commented 3 years ago

Thank you! If I try to get the second file of the catalog route i get an error: error:1400410B:SSL routines:CONNECT_CR_SRVR_HELLO:wrong version number URL is: https://nomad-lab.eu:80/prod/rae/beta/dcat/catalog?after=k0owICwEGs68Uvjj7HjtYzzaFzgp

TBoonX commented 3 years ago

I also noticed that the /catalog route does not contain the full creator and author data. Is it possible to include them also?

TBoonX commented 3 years ago

Some URLs are wrong, because they include the port. E.g.: https://nomad-lab.eu:80/prod/rae/beta/gui/entry/id/E-SzjykkRgGaJEzRl0M5RQ/XdvgSDBZgMmep7uK33Ne-EvjbI6x

markus1978 commented 3 years ago

I fixed the urls. I hope this was also the cause for the SSL errors you have?

I don't know if this is an rdflib issue, I if I am too stupid to use it. For some reason, the creators/publishers of nested entities (e.g. datasets in catalogs) are only included via id and reference without the actual user data. The only way I could help myself was to also add the creators (including the publishers) to the catalog. This way all the data is there. Does this work for you?

TBoonX commented 3 years ago

Thanks, the URLs are working and I am able to read the creator/publisher triples. I assume RDFlib is creating new nodes out of the blank nodes given to it and thus creators have their own "structure". This is correct. As long as all triples are in the /catalog response, the RDF tools are working I think.

I did encountered another error: 500 on https://nomad-lab.eu/prod/rae/beta/dcat/catalog/?format=pretty-xml&after=k0p8BEw8J4CCnVDwMoUsfJWQLqdN

TBoonX commented 3 years ago

I found a general typo: dcat:landing_page is wrong. It should be dcat:landingPage

TBoonX commented 3 years ago

Also vcard:organizationName does not exist. Please use vcard:organization or vcard:hasOrganizationName

markus1978 commented 3 years ago

I fixed the two spelling things (organizationName -> organization) and (landing_page -> landingPage). Not yet sure about the HTTP 500 yet. Have to do some more testing.

markus1978 commented 3 years ago

The HTTP 500 should also be fixed now. The latest fixes are only available under the beta deployment for now: https://nomad-lab.eu/prod/rae/beta/dcat/

The next release will be end of July.