gleanerio / gleaner

Gleaner: JSON-LD and structured data on the web harvesting
https://gleaner.io
Apache License 2.0
16 stars 10 forks source link

getNormSHA produces a blank string for certain documents #33

Closed nein09 closed 1 year ago

nein09 commented 2 years ago

This one is puzzling me. So, in my logs for crawling http://nsidc.org/, I have a bunch of non-identical json-ld objects, which are getting the same hash generated for them. I poked around and figure out that this is because proc.Normalize (line 38 in calcShaNorm.go) is generating an empty string. And when you calculate the SHA of a bunch of identical empty strings, it's going to be the same.

logger: acquire.go:206: #4 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 2553 for http://nsidc.org/data/NSIDC-0051/versions/1
logger: acquire.go:219: #4 thread for http://nsidc.org/data/NSIDC-0051/versions/1 
logger: acquire.go:206: #14 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 2495 for http://nsidc.org/data/NSIDC-0076/versions/1
logger: acquire.go:219: #14 thread for http://nsidc.org/data/NSIDC-0076/versions/1 
logger: acquire.go:206: #31 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 3046 for http://nsidc.org/data/NSIDC-0037/versions/1
logger: acquire.go:219: #31 thread for http://nsidc.org/data/NSIDC-0037/versions/1 
logger: acquire.go:206: #15 Uploading Bucket:gleaner File:summoned/nsidc/da39a3ee5e6b4b0d3255bfef95601890afd80709.jsonld Size 3667 for http://nsidc.org/data/NSIDC-0042/versions/1

Here's the config to crawl that site:

- name: nsidc
  url: https://nsidc.org/sitemap.xml
  headless: false
  properName: National Snow and Ice Data Center
  domain: https://nsidc.org

AND, also, they have their context specified with no trailing slash, and not https, so you need to add this to contextmaps:

- prefix: "http://schema.org"
  file: "./schemaorg-current-https.jsonld"

Is that a clue, there? Is json-gold not able to normalize a json-ld object that is set up this way?

I'm also finding that once I am able to get unique JSON-LD objects for each of the AADC sites in their sitemap*, it only generates 3 different SHAs for the whole set of them. I haven't looked into that much further.

fils commented 2 years ago

@nein09

so the

prefix: "http://schema.org"

Is just a flat out error. You need to inform the providers they should update their records. Sadly (INHO) Google will fix this error for people. I used to try and fix errors, but there can be so many combinations and at some point I felt it was a loosing game. Better, again in my opinion, to try and get the providers to fix their implementation at the source.

It would also be better if they moved to https vs http but both are technically correct in many peoples point of view. Though not mine.

That said, I can not say for certain if that is the issue with the SHA, but it will put a spanner in the works for sure.

The multiple JSON-LD issue is totally valid and likely a failing of mine. I knew this day would come. :)

I'll look at the code and see but I suspect we need to pull and store both since trying to filter based on the JSON-LD during harvest is likely hard. Once they are in the data warehouse / object store a person might be able to filter there for how you want or if you want to load them to the triplestore.

Hopefully there is an approach that fits with the existing pipeline well.

fils commented 2 years ago

@nein09

Add the "bad" prefix to the context mapping..

contextmaps:
- prefix: "https://schema.org/"
  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
- prefix: "http://schema.org/"
  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
- prefix: "http://schema.org"
  file: "./jsonldcontext.json"  # wget http://schema.org/docs/jsonldcontext.jsonld
summoner:

it will "sorta" work then.... though I need to explore this more

nein09 commented 2 years ago

Ah, thanks for this. I did work around this somewhat by adding the bad context in the way you describe - but i'm still getting the blank normalizations.

I have a branch pushed up for the multiple JSONs thing - but it doesn't store both of them, it just picks one and grabs it. It could store both if we want to, though - that'd be a smaller change, actually.

nein09 commented 2 years ago

Looking into this some more, I tried to crawl the AADC, which has well-formed json-ld. I dumped the normalized graph for

{"@context":"https://schema.org/","@type":"Dataset","creator":[{"@type":"Person","name":"REEVE, JONO"}],"sourceOrganization":{"@type":"Organization","name":"Australian Antarctic Data Centre"},"keywords":"EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC PRESSURE,SEA LEVEL PRESSURE,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC TEMPERATURE,SURFACE TEMPERATURE,AIR TEMPERATURE,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC WATER VAPOR,WATER VAPOR INDICATORS,HUMIDITY,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC WINDS,WIND DYNAMICS,CONVECTION,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC RADIATION,SOLAR RADIATION,OCEAN > SOUTHERN OCEAN,CONTINENT > ANTARCTICA,GEOGRAPHIC REGION > POLAR,ECHO SOUNDERS,[object Object],MARINE,METEOROLOGY","publisher":{"@type":"Organization","name":"Australian Antarctic Data Centre"},"inLanguage":"en","name":"MS Nella Dan Voyage V5 1980/81 (FIBEX) Track and Underway Data","description":"This dataset contains the underway data collected during the MS Nella Dan Voyage V5 1980/81 (FIBEX).\n\nVoyage name : First International BIOMASS Experiment \nVoyage leader: Knowles Ronald Kerry \n\nUnderway (meteorological) data are available online via the Australian Antarctic Division Data Centre web page (or via the Related URL section).","license":"http://creativecommons.org/licenses/by/4.0/","identifier":[{"@type":"PropertyValue","propertyID":"local","value":"198081050"},{"@type":"PropertyValue","propertyID":"URL","value":"https://data.aad.gov.au/metadata/records/198081050"},{"@type":"PropertyValue","propertyID":"URL","value":"https://data.aad.gov.au/metadata/records/198081050"},{"@type":"PropertyValue","propertyID":"global","value":"e23ff7d5-5002-4c09-9156-69012b72db01"}],"datePublished":"2010-02-15","spatialCoverage":[{"@type":"Place","geo":{"@type":"GeoShape","box":"-69.0 61.9 -43.2 147.3"}},{"@type":"Place","description":"text northlimit=-43.2; southlimit=-69.0; westlimit=61.9; eastLimit= 147.3; projection=WGS84"}],"temporalCoverage":"1981-01-19/1981-03-25"}

and it's just <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <Dataset> . - and because that's all the normalization results in, for each of the documents in this data repository, of course they all have the same SHA!

When I normalize that same document at https://json-ld.org/playground/, I get something more sensible:

_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/GeoShape> .
_:c14n1 <http://schema.org/name> "Australian Antarctic Data Centre" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:c14n10 <http://schema.org/propertyID> "URL" .
_:c14n10 <http://schema.org/value> "https://data.aad.gov.au/metadata/records/199495020" .
_:c14n10 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PropertyValue> .
_:c14n2 <http://schema.org/name> "Australian Antarctic Data Centre" .
_:c14n2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:c14n3 <http://schema.org/propertyID> "global" .
_:c14n3 <http://schema.org/value> "1853a32e-b951-40e1-befb-547cd6cbebb0" .
_:c14n3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PropertyValue> .
_:c14n4 <http://schema.org/propertyID> "local" .
_:c14n4 <http://schema.org/value> "199495020" .
_:c14n4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PropertyValue> .
_:c14n5 <http://schema.org/geo> _:c14n0 .
_:c14n5 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Place> .
_:c14n6 <http://schema.org/creator> _:c14n7 .
_:c14n6 <http://schema.org/datePublished> "1999-10-07"^^<http://schema.org/Date> .
_:c14n6 <http://schema.org/description> "This dataset contains the underway data from Voyage 2 1994-95 of the Aurora Australis. This was an resupply cruise, but NoQalms data types were logged at 20-second intervals. The observations were taken between October and December 1994 en route from Hobart to Casey to Davis and back to Hobart. See the Marine Science Support Data Quality Report via the Related URL section." .
_:c14n6 <http://schema.org/identifier> _:c14n10 .
_:c14n6 <http://schema.org/identifier> _:c14n3 .
_:c14n6 <http://schema.org/identifier> _:c14n4 .
_:c14n6 <http://schema.org/identifier> _:c14n9 .
_:c14n6 <http://schema.org/inLanguage> "en" .
_:c14n6 <http://schema.org/keywords> "EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC TEMPERATURE,SURFACE TEMPERATURE,AIR TEMPERATURE,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC WATER VAPOR,WATER VAPOR INDICATORS,HUMIDITY,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC WINDS,SURFACE WINDS,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC RADIATION,SOLAR RADIATION,EARTH SCIENCE,OCEANS,BATHYMETRY/SEAFLOOR TOPOGRAPHY,SEAFLOOR TOPOGRAPHY,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC PRESSURE,OCEAN > SOUTHERN OCEAN,GEOGRAPHIC REGION > POLAR,ECHO SOUNDERS,SHIPS,R/V AA,R/V Aurora Australis,BATHYMETRY,MARINE,OCEANOGRAPHY" .
_:c14n6 <http://schema.org/license> <http://creativecommons.org/licenses/by/4.0/> .
_:c14n6 <http://schema.org/name> "Aurora Australis Voyage 2 1994-95 Underway Data" .
_:c14n6 <http://schema.org/publisher> _:c14n1 .
_:c14n6 <http://schema.org/sourceOrganization> _:c14n2 .
_:c14n6 <http://schema.org/spatialCoverage> _:c14n5 .
_:c14n6 <http://schema.org/spatialCoverage> _:c14n8 .
_:c14n6 <http://schema.org/temporalCoverage> "1994-10-22/1994-12-01" .
_:c14n6 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Dataset> .
_:c14n7 <http://schema.org/name> "REEVE, JONO" .
_:c14n7 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n8 <http://schema.org/description> "text northlimit=-44.0; southlimit=-69.0; westlimit=79.0; eastLimit= 148.0; projection=WGS84" .
_:c14n8 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Place> .
_:c14n9 <http://schema.org/propertyID> "URL" .
_:c14n9 <http://schema.org/value> "https://data.aad.gov.au/metadata/records/199495020" .
_:c14n9 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PropertyValue> .

@fils any insight you have would be helpful!

nein09 commented 2 years ago

For my own reference: we are normalizing this JSON-LD using https://github.com/piprate/json-gold/

nein09 commented 2 years ago

Hmm, but https://json-ld.org/playground/1.0/ says it's invalid. - but json-gold specifically targets JSON-LD 1.1.

nein09 commented 2 years ago

Progress: @fils just pointed out that my example document had "@context": "https://schema.org/",, which is valid for JSON-LD 1.0, but not for 1.1, which is what json-gold does use.

If I add the following before normalizing the document, I get the expected output:

        case string:
            myInterface["@context"] = map[string]interface{}{"@vocab": myInterface["@context"]}
    }

We're looking into seeing whether there's a 1.0 switch for json-gold or something.

nein09 commented 2 years ago

Or we could somehow use an older version of the library, from when it supported 1.0. The problem then becomes one of heuristics - how do we decide which to use? Maybe it could be a setting in the config YAML.

nein09 commented 2 years ago

At any rate, I just ran a crawl of the AADC and NSIDC repositories, and this fixed the issue that I was seeing. So the next question is: what's the best way for us to support JSON-LD 1.0?

nein09 commented 2 years ago

A good clue: https://github.com/piprate/json-gold/blob/33b90c4ca86c41ea5c28cfd1ea0a5fd314c91848/ld/processor.go#L385

nein09 commented 2 years ago

Unfortunately, this gives me the same badly normalized graphs as processing with JSON-LD 1.1.


    // Sniff for JSON-LD 1.0; the default is 1.1
    switch myInterface["@context"].(type) {
        case string:
            fmt.Println("JSON-LD 1.0 detected; processing with that mode.")
            options.ProcessingMode = "json-ld-1.0"
    }
    normalizedTriples, err := proc.Normalize(myInterface, options)
nein09 commented 2 years ago

A hybrid approach seems to work for the nsidc, but generates duplicate documents for the AADC:

    switch myInterface["@context"].(type) {
        case string:
            options.ProcessingMode = "json-ld-1.0"
            myInterface["@context"] = map[string]interface{}{"@vocab": myInterface["@context"]}
    }
nein09 commented 2 years ago

I have https://github.com/gleanerio/gleaner/compare/dev...json-1.0?expand=1 going, but it seems to me that JLDProc will need to know about the domain we're working in somehow in order to be able to use the right JSON processing option. So some work will need to be done to be able to plumb that through.

valentinedwv commented 1 year ago

Context issues. #129 #130 should help resolve this, for empty normalize triples. New Identifier Approach will help a bit with this. Will find an ID... if it NormalizeTriples comes back empty, then it will sha the whole JSON.

131 adding some checks to normalize triples will be needed. If NormalizeTriples is not empty, and same set of triples is converted, then same SHA will be generated.