opendatateam / udata

Customizable and skinnable social platform dedicated to open data.
http://udata.readthedocs.org
GNU Affero General Public License v3.0
239 stars 87 forks source link

Harvest dataservices #3029

Closed ThibaudDauce closed 4 months ago

ThibaudDauce commented 6 months ago

Fix https://github.com/datagouv/data.gouv.fr/issues/1353

To harvest dataservices we first need to harvest datasets (because dataservices reference datasets in serveDatasets attribute). But right now, datasets are harvested asynchronously (by saving an HarvestJob and then queuing these jobs independently). It means we need to wait that all datasets are done before starting harvesting dataservices. Multiple options:

  1. Do the datasets' harvesting synchronously (previously talked in https://github.com/datagouv/data.gouv.fr/issues/1046#issuecomment-1964697779) then we can just loop for datasets in the graph then loop for the dataservices in the graph in the same function. require some changes in all backends, we need to keep the HarvestJob for debug only purpose
  2. Do the dataservices harvesting inside the finalize function that is called at the end of all the jobs. not a big fan because it adds one more class and a lot of code
  3. Do the dataservices harvesting inside some HarvestJob (either the same model than for the datasets or a new one) and do some celery magic to dispatch all jobs with dependencies chains. not a big fan because it complexify a lot the architecture
ThibaudDauce commented 5 months ago

On https://www.sandre.eaufrance.fr/atlas/srv/fre/csw the dataservices are inside distributions and don't contains an identifier.

    <dcat:distribution>
      <dcat:Distribution rdf:nodeID="N0ef9ba6862a24ea6add8e8f9616c4a33">
        <dct:title xml:lang="fr">Accéder aux métadonnées des stations de mesure de la quantité des eaux souterraines sous forme de métadonnées 19115</dct:title>
        <dct:description xml:lang="fr">Accéder aux métadonnées des stations de mesure de la quantité des eaux souterraines sous forme de métadonnées 19115</dct:description>
        <dcat:accessService>
          <dcat:DataService rdf:nodeID="Naf34a026e3e34680adb26205ca1df159">
            <dct:title xml:lang="fr">Accéder aux métadonnées des stations de mesure de la quantité des eaux souterraines sous forme de métadonnées 19115</dct:title>
            <dcat:endpointURL rdf:resource="http://services.ades.eaufrance.fr/metadata/"/>
            <dcat:endpointDescription rdf:resource="http://services.ades.eaufrance.fr/metadata/?version=1.0.0&amp;service=SANDRE:Metadata&amp;request=Getcapabilities"/>
          </dcat:DataService>
        </dcat:accessService>
        <dcat:accessURL rdf:resource="http://services.ades.eaufrance.fr/metadata/?version=1.0.0&amp;service=SANDRE:Metadata&amp;request=Getcapabilities"/>
        <dct:license>
          <dct:LicenseDocument rdf:nodeID="N9f1d42201f46426f9c77c68f776bf3cb">
            <rdfs:label xml:lang="fr">Licence Ouverte Etalab, https://www.etalab.gouv.fr/licence-ouverte-open-licence</rdfs:label>
          </dct:LicenseDocument>
        </dct:license>
        <dct:license>
          <dct:LicenseDocument rdf:nodeID="N8f182faf36574962b57454031aae48d7">
            <rdfs:label xml:lang="fr">Pas de restriction d'accès public</rdfs:label>
          </dct:LicenseDocument>
        </dct:license>
        <dct:accessRights>
          <dct:RightsStatement rdf:nodeID="Nb7d630e38ed0412f8d05052e44706698">
            <rdfs:label xml:lang="fr">Pas de restriction d'accès public</rdfs:label>
          </dct:RightsStatement>
        </dct:accessRights>
        <adms:representationTechnique rdf:resource="http://inspire.ec.europa.eu/metadata-codelist/SpatialRepresentationType/vector"/>
        <dct:format rdf:resource="http://publications.europa.eu/resource/authority/file-type/PNG"/>
        <cnt:characterEncoding rdf:datatype="http://www.w3.org/2001/XMLSchema#string">UTF-8</cnt:characterEncoding>
      </dcat:Distribution>
    </dcat:distribution>
ThibaudDauce commented 5 months ago

On https://data.naturefrance.fr/geonetwork/srv/eng/csw (csw-dcat) we have an rdf:Description containing link to datasets/distributions (?) inside dct:relation

<rdf:Description rdf:about="https://data.naturefrance.fr/geonetwork/srv/resources/records/https://inpn-inspire.mnhn.fr/catalogue/srv/cdda_2019_view">
    <dct:identifier>https://inpn-inspire.mnhn.fr/catalogue/srv/cdda_2019_view</dct:identifier>
    <dcat:landingPage>https://data.naturefrance.fr/geonetwork/srv/resources/records/https://inpn-inspire.mnhn.fr/catalogue/srv/cdda_2019_view</dcat:landingPage>
    <dct:title>Géoservice WMS INPN</dct:title>
    <dct:abstract>Géoservice de visualisation des données rapportées
CDDA, Natura2000, DHFF habitats et espèces, DO, EEE

GEMET - INSPIRE themes, version 1.0 : Protected sites.
INSPIRE priority data set : Nationally designated areas - CDDA.</dct:abstract>
    <dcat:theme>
      <skos:Concept rdf:about="https://data.naturefrance.fr/geonetwork/srv/resources/records/registries/vocabularies/GEMET%20-%20INSPIRE%20themes%2C%20version%201.0/concepts/Protected%20sites">
        <skos:inScheme rdf:resource="https://data.naturefrance.fr/geonetwork/srv/resources/records/registries/vocabularies/GEMET%20-%20INSPIRE%20themes%2C%20version%201.0"/>
        <skos:prefLabel>Protected sites</skos:prefLabel>
      </skos:Concept>
    </dcat:theme>
    <dcat:theme>
      <skos:Concept rdf:about="https://data.naturefrance.fr/geonetwork/srv/resources/records/registries/vocabularies/INSPIRE%20priority%20data%20set/concepts/Nationally%20designated%20areas%20-%20CDDA">
        <skos:inScheme rdf:resource="https://data.naturefrance.fr/geonetwork/srv/resources/records/registries/vocabularies/INSPIRE%20priority%20data%20set"/>
        <skos:prefLabel>Nationally designated areas - CDDA</skos:prefLabel>
      </skos:Concept>
    </dcat:theme>
    <dct:issued>2019-06</dct:issued>
    <dct:publisher>
      <foaf:Organization rdf:about="https://data.naturefrance.fr/geonetwork/srv/resources/records/organizations/UMS%202006%20Patrimoine%20Naturel">
        <foaf:name>UMS 2006 Patrimoine Naturel</foaf:name>
        <foaf:member rdf:resource="https://data.naturefrance.fr/geonetwork/srv/resources/records/persons/ep_spn%40mnhn.fr"/>
      </foaf:Organization>
    </dct:publisher>
    <dct:accessRights rdf:resource="http://inspire.ec.europa.eu/registry/metadata-codelist/ConditionsApplyingToAccessAndUse/noConditionsApply"/>
    <dcat:distribution>
      <dcat:Distribution rdf:about="https://data.naturefrance.fr/geonetwork/records/a5763b0f-3c38-40ff-b078-e64b2a29a573#OGC%3AWMS-1.3.0-http-get-capabilities-">
        <dcat:accessURL>https://inpn-inspire.mnhn.fr/geoservices/INPN_INSPIRE/wms?service=WMS&amp;version=1.3.0&amp;request=GetCapabilities</dcat:accessURL>
        <dct:format>OGC:WMS-1.3.0-http-get-capabilities</dct:format>
      </dcat:Distribution>
    </dcat:distribution>
    <dct:relation rdf:resource="https://data.naturefrance.fr/geonetwork/records/924d3402-06e2-4916-8788-caba18b631cb"/>
    <dct:relation rdf:resource="https://data.naturefrance.fr/geonetwork/records/15f97f5e-2619-4889-badd-fd760cf9ee33"/>
    <dct:relation rdf:resource="https://data.naturefrance.fr/geonetwork/records/0fe01bae-811b-4692-8099-9f301b6792a5"/>
    <dct:relation rdf:resource="https://data.naturefrance.fr/geonetwork/records/9f41e4b2-9734-4ccd-ac61-275723c7811c"/>
    <dct:relation rdf:resource="https://data.naturefrance.fr/geonetwork/records/534b5e77-50bb-44b2-b3d0-430f908949b2"/>
    <dct:relation rdf:resource="https://data.naturefrance.fr/geonetwork/records/11663514-ddd0-4e66-bd58-3540d18c6784"/>
    <dct:relation rdf:resource="https://data.naturefrance.fr/geonetwork/records/9b05d1ea-f11f-4e3f-b8f2-8596c8ac1e75"/>
  </rdf:Description>