magda-io / magda

A federated, open-source data catalog for all your big data and small data
https://magda.io
Apache License 2.0
513 stars 93 forks source link

Indexer takes 5 minutes to index a page with bad spatial extent #2061

Closed AlexGilleran closed 5 years ago

AlexGilleran commented 5 years ago

Problem description

We recently found a problem where the data.gov.au dataset with record id caused webhooks to the indexer to completely lock up, on account of the indexer taking 5 minutes to index the page of events.

The event page had these records in it:

“dist-sdinsw-{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}-0”
“dist-sdinsw-{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}-1"
“dist-sdinsw-{B67E5266-0AAA-456E-A094-DA0DFB67DD32}-1”
“ds-sdinsw-{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}”
“ds-sdinsw-{B67E5266-0AAA-456E-A094-DA0DFB67DD32}”

The two datasets at the time of writing look like:

{
  "id": "ds-sdinsw-{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}",
  "name": "{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}",
  "aspects": {
    "source": {
      "id": "sdinsw",
      "url": "https://sdi.nsw.gov.au/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=%7B02E30FD2-D967-4F25-BEEB-E2D9B025C5E2%7D",
      "name": "NSW Land and Property",
      "type": "csw-dataset"
    },
    "dataset-linked-data-rating": {
      "stars": 0
    },
    "dcat-dataset-strings": {
      "spatial": "POLYGON((-180 20106.198, -3009.0384 20106.198, -3009.0384 90, -180 90, -180 20106.198))",
      "description": "Native vegetaion mapping of the Barmah National Park (Vic), part of Barmah Forest area.This map was prepared to specifically show the spatial delineation of stem density and canopy condition categories of vegetation in the Barmah and Millewa Forest areas. This map was prepared using aerial photo interpretation (API) of ADS40 digital aerial photography captured in 2010.  The map was prepared by updating (linework and attribution) of existing vegetation mapping produced in 2005 by Doug Frood (Frood 2007). Barmah National Park and the Murray River Park collectively known as the Barmah Forest. VIS_ID 3870",
      "accrualPeriodicity": "notPlanned",
      "modified": "2012-09-11",
      "issued": "2012-09-11",
      "contactPoint": "Office of Environment and Heritage (OEH), data.broker@environment.nsw.gov.au",
      "languages": [
        "eng"
      ],
      "temporal": {
        "end": "2010-05",
        "start": "2009-12"
      },
      "publisher": "Office of Environment and Heritage (OEH), data.broker@environment.nsw.gov.au",
      "keywords": [
        "nsw"
      ],
      "title": "Vegetation community and river red gum canopy condition map of Barmah National Park. VIS_ID 3870",
      "themes": [
        "biota",
        "environment"
      ]
    },
    "temporal-coverage": {
      "intervals": [
        {
          "end": "2010-05",
          "start": "2009-12"
        }
      ]
    },
    "dataset-publisher": {
      "publisher": {
        "id": "org-sdinsw-Office of Environment and Heritage (OEH)",
        "name": "Office of Environment and Heritage (OEH)",
        "aspects": {
          "source": {
            "id": "sdinsw",
            "url": "https://sdi.nsw.gov.au/csw",
            "name": "NSW Land and Property",
            "type": "csw-organization"
          },
          "organization-details": {
            "name": "Office of Environment and Heritage (OEH)",
            "email": "data.broker@environment.nsw.gov.au",
            "addrState": "NSW",
            "addrSuburb": "Parramatta",
            "addrStreet": "PO Box 3720",
            "title": "Office of Environment and Heritage (OEH)",
            "addrPostCode": "2124",
            "phone": "02 6740 2349",
            "addrCountry": "Australia"
          }
        }
      }
    },
    "dataset-distributions": {
      "distributions": [
        {
          "id": "dist-sdinsw-{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}-0",
          "name": "{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}-0",
          "aspects": {
            "source": {
              "id": "sdinsw",
              "url": "https://sdi.nsw.gov.au/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=%7B02E30FD2-D967-4F25-BEEB-E2D9B025C5E2%7D",
              "name": "NSW Land and Property",
              "type": "csw-distribution"
            },
            "csw-distribution": {},
            "source-link-status": {
              "status": "broken",
              "errorDetails": {
                "code": "UNABLE_TO_VERIFY_LEAF_SIGNATURE"
              }
            },
            "dcat-distribution-strings": {
              "issued": "2012-09-11",
              "rights": "Please refer to the file readme.txt contained in the data package for use restrictions and Licensing. For further detail or inquiries contact data.broker@environment.nsw.gov.au."
            }
          }
        },
        {
          "id": "dist-sdinsw-{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}-1",
          "name": "{02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}-1",
          "aspects": {
            "source": {
              "id": "sdinsw",
              "url": "https://sdi.nsw.gov.au/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=%7B02E30FD2-D967-4F25-BEEB-E2D9B025C5E2%7D",
              "name": "NSW Land and Property",
              "type": "csw-distribution"
            },
            "csw-distribution": {},
            "dcat-distribution-strings": {
              "issued": "2012-09-11",
              "rights": "Please refer to the file readme.txt contained in the data package for use restrictions and Licensing. For further detail or inquiries contact data.broker@environment.nsw.gov.au.",
              "accessURL": "http://mapdata.environment.nsw.gov.au/geonetwork/srv/en/metadata.show?uuid={02E30FD2-D967-4F25-BEEB-E2D9B025C5E2}"
            }
          }
        }
      ]
    },
    "dataset-quality-rating": {
      "dataset-linked-data-rating": {
        "score": 0,
        "weighting": 1
      }
    }
  },
  "sourceTag": "cc932ccd-79f7-4151-b6d7-94a029505005"
}

and

{
  "id": "ds-sdinsw-{B67E5266-0AAA-456E-A094-DA0DFB67DD32}",
  "name": "{B67E5266-0AAA-456E-A094-DA0DFB67DD32}",
  "aspects": {
    "source": {
      "id": "sdinsw",
      "url": "https://sdi.nsw.gov.au/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=%7BB67E5266-0AAA-456E-A094-DA0DFB67DD32%7D",
      "name": "NSW Land and Property",
      "type": "csw-dataset"
    },
    "dataset-linked-data-rating": {
      "stars": 0
    },
    "dcat-dataset-strings": {
      "spatial": "POLYGON((147.001 -35.008, 148.519 -35.008, 148.519 -33.989, 147.001 -33.989, 147.001 -35.008))",
      "description": "This map is one of a series of soil landscape maps that are intended for all of central and eastern NSW, based on standard 1:100,000 and 1:250,000 topographic sheets. The map provides an inventory of soil and landscape properties of the area and identifies major soil and landscape qualities and constraints. It integrates soil and topographic features into single units with relatively uniform land management requirements. Soils are described in terms of soil materials in addition to the Australian Soil Classification and the Great Soil Group systems.",
      "accrualPeriodicity": "asNeeded",
      "modified": "2010-09-20",
      "issued": "2010-09-20",
      "contactPoint": "Office of Environment and Heritage (OEH), data.broker@environment.nsw.gov.au",
      "languages": [
        "eng"
      ],
      "temporal": {
        "end": "2010",
        "start": "1994"
      },
      "publisher": "Office of Environment and Heritage (OEH), data.broker@environment.nsw.gov.au",
      "keywords": [
        "nsw",
        "AGRICULTURE",
        "GEOSCIENCES-Geology",
        "GEOSCIENCES-Geomorphology",
        "HAZARDS-Flood",
        "HAZARDS-Landslip",
        "LAND-Topography",
        "SOIL",
        "SOIL-Chemistry",
        "SOIL-Erosion",
        "SOIL-Physics",
        "VEGETATION"
      ],
      "title": "Soil Landscapes of the Cootamundra 1:250,000 Sheet",
      "themes": [
        "environment"
      ]
    },
    "temporal-coverage": {
      "intervals": [
        {
          "end": "2010",
          "start": "1994"
        }
      ]
    },
    "dataset-publisher": {
      "publisher": {
        "id": "org-sdinsw-Office of Environment and Heritage (OEH)",
        "name": "Office of Environment and Heritage (OEH)",
        "aspects": {
          "source": {
            "id": "sdinsw",
            "url": "https://sdi.nsw.gov.au/csw",
            "name": "NSW Land and Property",
            "type": "csw-organization"
          },
          "organization-details": {
            "name": "Office of Environment and Heritage (OEH)",
            "email": "data.broker@environment.nsw.gov.au",
            "addrState": "NSW",
            "addrSuburb": "Parramatta",
            "addrStreet": "PO Box 3720",
            "title": "Office of Environment and Heritage (OEH)",
            "addrPostCode": "2124",
            "phone": "02 6740 2349",
            "addrCountry": "Australia"
          }
        }
      }
    },
    "dataset-distributions": {
      "distributions": [
        {
          "id": "dist-sdinsw-{B67E5266-0AAA-456E-A094-DA0DFB67DD32}-0",
          "name": "{B67E5266-0AAA-456E-A094-DA0DFB67DD32}-0",
          "aspects": {
            "source": {
              "id": "sdinsw",
              "url": "https://sdi.nsw.gov.au/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=%7BB67E5266-0AAA-456E-A094-DA0DFB67DD32%7D",
              "name": "NSW Land and Property",
              "type": "csw-distribution"
            },
            "csw-distribution": {},
            "source-link-status": {
              "status": "broken",
              "errorDetails": {
                "code": "UNABLE_TO_VERIFY_LEAF_SIGNATURE"
              }
            },
            "dcat-distribution-strings": {
              "issued": "2010-09-20",
              "rights": "Please refer to the file readme.txt contained in the data package for use restrictions and Licensing. For further detail or inquiries contact data.broker@environment.nsw.gov.au."
            }
          }
        },
        {
          "id": "dist-sdinsw-{B67E5266-0AAA-456E-A094-DA0DFB67DD32}-1",
          "name": "{B67E5266-0AAA-456E-A094-DA0DFB67DD32}-1",
          "aspects": {
            "source": {
              "id": "sdinsw",
              "url": "https://sdi.nsw.gov.au/csw?service=CSW&version=2.0.2&request=GetRecordById&elementsetname=full&outputschema=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&typeNames=gmd%3AMD_Metadata&id=%7BB67E5266-0AAA-456E-A094-DA0DFB67DD32%7D",
              "name": "NSW Land and Property",
              "type": "csw-distribution"
            },
            "csw-distribution": {},
            "dcat-distribution-strings": {
              "issued": "2010-09-20",
              "rights": "Please refer to the file readme.txt contained in the data package for use restrictions and Licensing. For further detail or inquiries contact data.broker@environment.nsw.gov.au.",
              "accessURL": "http://mapdata.environment.nsw.gov.au/geonetwork/srv/en/metadata.show?uuid={B67E5266-0AAA-456E-A094-DA0DFB67DD32}"
            }
          }
        }
      ]
    },
    "dataset-quality-rating": {
      "dataset-linked-data-rating": {
        "score": 0,
        "weighting": 1
      }
    }
  },
  "sourceTag": "cc932ccd-79f7-4151-b6d7-94a029505005"
}

I think the most likely cause is the wacky spatial extent in the first one: POLYGON((-180 20106.198, -3009.0384 20106.198, -3009.0384 90, -180 90, -180 20106.198))". I've got no idea how elasticsearch didn't just reject it outright like it usually does.

In any case if that is the culprit we should be rejecting it outright. We should also take a look at how it's coming about - are they actually declaring it with those boundaries or is it a bad parse from us?

t83714 commented 5 years ago

Closed via PR 2111