magda-io / magda

A federated, open-source data catalog for all your big data and small data
https://magda.io
Apache License 2.0
510 stars 93 forks source link

TERN CSW connector failed to capture distributions for some datasets #2009

Closed t83714 closed 5 years ago

t83714 commented 5 years ago

Problem description

Knowledge network uses CSW connector to crawl TERN data source. Endpoint: https://geonetwork.tern.org.au/geonetwork/srv/eng/csw

The crawl result shows some datasets has 0 distribution.

Example is:

https://staging-test.knowledgenet.co/dataset/ds-tern-45e488c7-38ad-40a8-97e7-ad16a9c1c8f9

The dataset detail is:

{
  "id": "ds-tern-45e488c7-38ad-40a8-97e7-ad16a9c1c8f9",
  "name": "TERN OzFlux Dry River Tower Data Service",
  "aspects": {
    "source": {
      "id": "tern",
      "url": "https://geonetwork.tern.org.au/geonetwork/srv/eng/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=45e488c7-38ad-40a8-97e7-ad16a9c1c8f9",
      "name": "Terrestrial Ecosystem Research Network",
      "type": "csw-dataset"
    },
    "dataset-linked-data-rating": {
      "stars": 0
    },
    "dcat-dataset-strings": {
      "spatial": "POLYGON((132.3706 -15.2588, 132.3706 -15.2588, 132.3706 -15.2588, 132.3706 -15.2588, 132.3706 -15.2588))",
      "description": "This service provides access to ecosystem flux data from the Dry River site, 89 km south of Katherine in the Northern Territory. The site is open forest savanna.",
      "modified": "2018-03-23T09:00:00",
      "issued": "2016-01-06T09:00:00",
      "contactPoint": "Terrestrial Ecosystem Research Network (TERN), esupport@tern.org.au",
      "languages": [
        "eng"
      ],
      "temporal": {
        "end": "2017-06-19",
        "start": "2008-08-31"
      },
      "publisher": "Terrestrial Ecosystem Research Network (TERN), esupport@tern.org.au",
      "keywords": [
        "Environmental Monitoring",
        "Ecosystem Function",
        "SOIL SCIENCES",
        "ATMOSPHERIC SCIENCES",
        "EARTH SCIENCES",
        "ENVIRONMENTAL SCIENCES",
        "ECOLOGICAL APPLICATIONS",
        "OpenDAP",
        "HEAT FLUX",
        "LAND PRODUCTIVITY",
        "AIR TEMPERATURE",
        "SHORTWAVE RADIATION",
        "PHOTOSYNTHETICALLY ACTIVE RADIATION",
        "LONGWAVE RADIATION",
        "SOIL MOISTURE/WATER CONTENT",
        "BIOGEOCHEMICAL PROCESSES",
        "TRACE GASES/TRACE SPECIES",
        "SOIL TEMPERATURE",
        "PRECIPITATION AMOUNT",
        "CARBON DIOXIDE",
        "ATMOSPHERIC PRESSURE MEASUREMENTS",
        "INCOMING SOLAR RADIATION",
        "TURBULENCE",
        "CARBON",
        "HUMIDITY",
        "WIND SPEED/WIND DIRECTION",
        "EVAPOTRANSPIRATION",
        "TERRESTRIAL ECOSYSTEMS"
      ],
      "title": "TERN OzFlux Dry River Tower Data Service",
      "themes": [
        "climatologyMeteorologyAtmosphere"
      ]
    },
    "temporal-coverage": {
      "intervals": [
        {
          "end": "2017-06-19",
          "start": "2008-08-31"
        }
      ]
    },
    "dataset-publisher": {
      "publisher": {
        "id": "org-tern-Terrestrial Ecosystem Research Network (TERN)",
        "name": "Terrestrial Ecosystem Research Network (TERN)",
        "aspects": {
          "source": {
            "id": "tern",
            "url": "https://geonetwork.tern.org.au/geonetwork/srv/eng/csw",
            "name": "Terrestrial Ecosystem Research Network",
            "type": "csw-organization"
          },
          "organization-details": {
            "name": "Terrestrial Ecosystem Research Network (TERN)",
            "email": "esupport@tern.org.au",
            "addrState": "QLD",
            "addrSuburb": "Brisbane",
            "addrStreet": "Building 8, University of Queensland",
            "title": "Terrestrial Ecosystem Research Network (TERN)",
            "addrPostCode": "4072",
            "phone": "+61 7 3365 9097",
            "addrCountry": "Australia"
          }
        }
      }
    },
    "dataset-distributions": {
      "distributions": []
    },
    "dataset-quality-rating": {
      "dataset-linked-data-rating": {
        "score": 0,
        "weighting": 1
      }
    }
  },
  "sourceTag": "50b245aa-0da6-4bef-8590-640938bd7b07"
}

The crawling URL for that dataset is:

https://geonetwork.tern.org.au/geonetwork/srv/eng/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=45e488c7-38ad-40a8-97e7-ad16a9c1c8f9

Our CSW connector currently looks for distributions at JSON path: $.distributionInfo[*].MD_Distribution[*].transferOptions[*].MD_DigitalTransferOptions[*].onLine[*].CI_OnlineResource[*]

However, some datasets my have distribution info published at: $.distributionInfo[*].MD_Distribution[*].distributionFormat[*].MD_Format[*].formatDistributor[*].D_Distributor[*].distributorTransferOptions[*].MD_DigitalTransferOptions[*].onLine[*].CI_OnlineResource[*]

Proposed Fix

Have CSW connector search for all MD_DigitalTransferOptions nodes rather than under any particular JSON path.

jyucsiro commented 5 years ago

It looks like GA also follow a similar metadata publication format. See https://ecat.ga.gov.au/geonetwork/srv/api/records/a977e563-c6e3-4d17-bbba-b390d28ca0a7/formatters/xml