gleanerio / gleaner

Gleaner: JSON-LD and structured data on the web harvesting
https://gleaner.io
Apache License 2.0
16 stars 10 forks source link

proc.NormalizedTriples Catch Empty and Duplicate SHA's #131

Open valentinedwv opened 1 year ago

valentinedwv commented 1 year ago

common/jsoldProc needs to have a sanity wrapper function for JLDProc.Normalize()

Often if there is a context issue, it generates an empty string, or the same set of triples for many files The empty string is caught in the calSha function, so unique identifiers are generated using the whole file. But if the same set of triples is generated, then the ID will be the same.

Can we check to see if the number of triples is significantly less that the info in the jsonld file Thoughts on this.


Below normalizes a single triple: _:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <Dataset> .

{
  "@context": [
    "https://schema.org/",
    {
      "gsqtime": "https://vocabs.gsq.digital/object?uri=http://linked.data.gov.au/def/trs",
      "time": "http://www.w3.org/2006/time#",
      "xsd": "https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html"
    }
  ],
  "@type": "Dataset",
  "identifier": "https://dx.doi.org/10.7288/V4/MAGIC/11858",
  "sameAs": [
    "https://earthref.org/MagIC/doi/10.1139/E79-201"
  ],
  "isAccessibleForFree": true,
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "provider": {
    "@id": "https://earthref.org/MagIC",
    "@type": "Organization",
    "identifier": "https://www.re3data.org/repository/r3d100011910",
    "legalName": "Magnetics Information Consortium (MagIC) Data Repository",
    "name": "MagIC",
    "url": "https://earthref.org/MagIC"
  },
  "publisher": {
    "@id": "https//earthref.org/MagIC"
  },
  "sdPublisher": "EarthRef.org",
  "sdLicense": "https://creativecommons.org/licenses/by/4.0/",
  "sdDatePublished": "2022-12-23T03:57:01.989Z",
  "distribution": {
    "@type": "DataDownload",
    "contentUrl": "https://earthref.org/MagIC/download/11858/magic_contribution_11858.txt",
    "encodingFormat": [
      "text/plain; application=earthref-tsv",
      "EarthRef-tsv-Multipart"
    ]
  },
  "version": 2,
  "contributor": "Luke Fairchild",
  "dateModified": "2017-01-29T19:25:52.348Z",
  "citation": "<b>Lauri J. Pesonen, Henry C. Halls (1979).</b> The paleomagnetism of Keweenawan dikes from Baraga and Marquette Counties, northern Michigan. <i>Canadian Journal of Earth Sciences 16 (11):2136-2149. doi:<a href='//dx.doi.org/10.1139/e79-201'>10.1139/e79-201</a>.</i>",
  "name": "<b>Lauri J. Pesonen, Henry C. Halls (1979).</b> The paleomagnetism of Keweenawan dikes from Baraga and Marquette Counties, northern Michigan. <i>Canadian Journal of Earth Sciences 16 (11):2136-2149. doi:<a href='//dx.doi.org/10.1139/e79-201'>10.1139/e79-201</a>.</i> (Dataset)",
  "description": "Paleomagnetic, rock magnetic, or geomagnetic data found in the MagIC data repository from a paper titled: <b>Lauri J. Pesonen, Henry C. Halls (1979).</b> The paleomagnetism of Keweenawan dikes from Baraga and Marquette Counties, northern Michigan. <i>Canadian Journal of Earth Sciences 16 (11):2136-2149. doi:<a href='//dx.doi.org/10.1139/e79-201'>10.1139/e79-201</a>.</i>",
  "keywords": [
    "General Earth and Planetary Sciences"
  ],
  "datePublished": "2017-01-29T19:25:52.348Z",
  "spatialCoverage": {
    "@type": "Place",
    "geo": [
      {
        "@type": "GeoCoordinates",
        "latitude": 46.55403282,
        "longitude": -87.37260384
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.56024914,
        "longitude": -87.37415792
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.56439335,
        "longitude": -87.37415792
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.59858308,
        "longitude": -87.41352792
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.50888215,
        "longitude": -87.42019832
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.60169124,
        "longitude": -87.42751462
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.6265565,
        "longitude": -87.45600607
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.636399,
        "longitude": -87.4881237
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.5638849,
        "longitude": -87.50093336
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.63121874,
        "longitude": -87.51661514
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.6068715,
        "longitude": -87.55598514
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.60443615,
        "longitude": -87.66409178
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.44823841,
        "longitude": -87.955328
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.67591355,
        "longitude": -88.44231047
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.67379063,
        "longitude": -88.4635396
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.63802132,
        "longitude": -88.48360911
      },
      {
        "@type": "GeoCoordinates",
        "latitude": 46.56977917,
        "longitude": -88.52449947
      }
    ]
  },
  "variableMeasured": [
    {
      "@type": "PropertyValue",
      "name": "Direction N Samples",
      "description": "Number of samples included in directional calculations.",
      "minValue": 1,
      "maxValue": 8
    },
    {
      "@type": "PropertyValue",
      "name": "Pole Latitude",
      "description": "Pole latitude, average of site VGP latitudes, north pole",
      "minValue": 48.4,
      "maxValue": 48.9,
      "unitText": "Degrees"
    },
    {
      "@type": "PropertyValue",
      "name": "Pole N Sites",
      "description": "Number of sites included in pole calculations",
      "minValue": 4,
      "maxValue": 14
    },
    {
      "@type": "PropertyValue",
      "name": "Longitude",
      "description": "Sample geographic location, Longitude",
      "minValue": -88.52449947,
      "maxValue": -87.37260384,
      "unitText": "Degrees"
    },
    {
      "@type": "PropertyValue",
      "name": "Direction K",
      "description": "Specimen direction in coordinates specified by tilt correction, Fisher's dispersion parameter Kappa",
      "minValue": 11,
      "maxValue": 822,
      "unitText": "Dimensionless"
    },
    {
      "@type": "PropertyValue",
      "name": "Pole Longitude",
      "description": "Pole longitude, average of site VGP longitudes, north pole",
      "minValue": 213.5,
      "maxValue": 238.2,
      "unitText": "Degrees"
    },
    {
      "@type": "PropertyValue",
      "name": "Latitude",
      "description": "Sample geographic location, Latitude",
      "minValue": 46.44823841,
      "maxValue": 46.67591355,
      "unitText": "Degrees"
    },
    {
      "@type": "PropertyValue",
      "name": "Declination",
      "description": "Directions in specimen coordinates, Declination",
      "minValue": 64.5,
      "maxValue": 179.4,
      "unitText": "Degrees"
    },
    {
      "@type": "PropertyValue",
      "name": "Inclination",
      "description": "Directions in specimen coordinates, Inclination",
      "minValue": -84.1,
      "maxValue": -53.1,
      "unitText": "Degrees"
    },
    {
      "@type": "PropertyValue",
      "name": "Direction Alpha 95%",
      "description": "Specimen direction in coordinates specified by tilt correction, Fisher circle",
      "minValue": 2.8,
      "maxValue": 40.1,
      "unitText": "Degrees"
    },
    {
      "@type": "PropertyValue",
      "name": "Pole DM",
      "description": "Pole meridian uncertainty",
      "minValue": 6.2,
      "maxValue": 10.2,
      "unitText": "Degrees"
    },
    {
      "@type": "PropertyValue",
      "name": "Pole DP",
      "description": "Pole parallel latitude uncertainty",
      "minValue": 5.2,
      "maxValue": 9.6,
      "unitText": "Degrees"
    }
  ]
}
valentinedwv commented 1 year ago

If you hack the context in the debugger you can get triples. Is there some sanity check that we can make along the way to say, hey, source is the long, triples are this long, seems like something did not convert?

{ "@vocab": "https://schema.org/", "gsqtime": "https://vocabs.gsq.digital/object?uri=http://linked.data.gov.au/def/trs", "time": "http://www.w3.org/2006/time#", "xsd": "https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes.html" }

valentinedwv commented 1 year ago

Thoughts on this.