Closed nein09 closed 1 year ago
@nein09
so the
prefix: "http://schema.org"
Is just a flat out error. You need to inform the providers they should update their records. Sadly (INHO) Google will fix this error for people. I used to try and fix errors, but there can be so many combinations and at some point I felt it was a loosing game. Better, again in my opinion, to try and get the providers to fix their implementation at the source.
It would also be better if they moved to https vs http but both are technically correct in many peoples point of view. Though not mine.
That said, I can not say for certain if that is the issue with the SHA, but it will put a spanner in the works for sure.
The multiple JSON-LD issue is totally valid and likely a failing of mine. I knew this day would come. :)
I'll look at the code and see but I suspect we need to pull and store both since trying to filter based on the JSON-LD during harvest is likely hard. Once they are in the data warehouse / object store a person might be able to filter there for how you want or if you want to load them to the triplestore.
Hopefully there is an approach that fits with the existing pipeline well.
@nein09
Add the "bad" prefix to the context mapping..
contextmaps:
- prefix: "https://schema.org/"
file: "./jsonldcontext.json" # wget http://schema.org/docs/jsonldcontext.jsonld
- prefix: "http://schema.org/"
file: "./jsonldcontext.json" # wget http://schema.org/docs/jsonldcontext.jsonld
- prefix: "http://schema.org"
file: "./jsonldcontext.json" # wget http://schema.org/docs/jsonldcontext.jsonld
summoner:
it will "sorta" work then.... though I need to explore this more
Ah, thanks for this. I did work around this somewhat by adding the bad context in the way you describe - but i'm still getting the blank normalizations.
I have a branch pushed up for the multiple JSONs thing - but it doesn't store both of them, it just picks one and grabs it. It could store both if we want to, though - that'd be a smaller change, actually.
Looking into this some more, I tried to crawl the AADC, which has well-formed json-ld. I dumped the normalized graph for
{"@context":"https://schema.org/","@type":"Dataset","creator":[{"@type":"Person","name":"REEVE, JONO"}],"sourceOrganization":{"@type":"Organization","name":"Australian Antarctic Data Centre"},"keywords":"EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC PRESSURE,SEA LEVEL PRESSURE,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC TEMPERATURE,SURFACE TEMPERATURE,AIR TEMPERATURE,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC WATER VAPOR,WATER VAPOR INDICATORS,HUMIDITY,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC WINDS,WIND DYNAMICS,CONVECTION,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC RADIATION,SOLAR RADIATION,OCEAN > SOUTHERN OCEAN,CONTINENT > ANTARCTICA,GEOGRAPHIC REGION > POLAR,ECHO SOUNDERS,[object Object],MARINE,METEOROLOGY","publisher":{"@type":"Organization","name":"Australian Antarctic Data Centre"},"inLanguage":"en","name":"MS Nella Dan Voyage V5 1980/81 (FIBEX) Track and Underway Data","description":"This dataset contains the underway data collected during the MS Nella Dan Voyage V5 1980/81 (FIBEX).\n\nVoyage name : First International BIOMASS Experiment \nVoyage leader: Knowles Ronald Kerry \n\nUnderway (meteorological) data are available online via the Australian Antarctic Division Data Centre web page (or via the Related URL section).","license":"http://creativecommons.org/licenses/by/4.0/","identifier":[{"@type":"PropertyValue","propertyID":"local","value":"198081050"},{"@type":"PropertyValue","propertyID":"URL","value":"https://data.aad.gov.au/metadata/records/198081050"},{"@type":"PropertyValue","propertyID":"URL","value":"https://data.aad.gov.au/metadata/records/198081050"},{"@type":"PropertyValue","propertyID":"global","value":"e23ff7d5-5002-4c09-9156-69012b72db01"}],"datePublished":"2010-02-15","spatialCoverage":[{"@type":"Place","geo":{"@type":"GeoShape","box":"-69.0 61.9 -43.2 147.3"}},{"@type":"Place","description":"text northlimit=-43.2; southlimit=-69.0; westlimit=61.9; eastLimit= 147.3; projection=WGS84"}],"temporalCoverage":"1981-01-19/1981-03-25"}
and it's just <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <Dataset> .
- and because that's all the normalization results in, for each of the documents in this data repository, of course they all have the same SHA!
When I normalize that same document at https://json-ld.org/playground/, I get something more sensible:
_:c14n0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/GeoShape> .
_:c14n1 <http://schema.org/name> "Australian Antarctic Data Centre" .
_:c14n1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:c14n10 <http://schema.org/propertyID> "URL" .
_:c14n10 <http://schema.org/value> "https://data.aad.gov.au/metadata/records/199495020" .
_:c14n10 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PropertyValue> .
_:c14n2 <http://schema.org/name> "Australian Antarctic Data Centre" .
_:c14n2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Organization> .
_:c14n3 <http://schema.org/propertyID> "global" .
_:c14n3 <http://schema.org/value> "1853a32e-b951-40e1-befb-547cd6cbebb0" .
_:c14n3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PropertyValue> .
_:c14n4 <http://schema.org/propertyID> "local" .
_:c14n4 <http://schema.org/value> "199495020" .
_:c14n4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PropertyValue> .
_:c14n5 <http://schema.org/geo> _:c14n0 .
_:c14n5 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Place> .
_:c14n6 <http://schema.org/creator> _:c14n7 .
_:c14n6 <http://schema.org/datePublished> "1999-10-07"^^<http://schema.org/Date> .
_:c14n6 <http://schema.org/description> "This dataset contains the underway data from Voyage 2 1994-95 of the Aurora Australis. This was an resupply cruise, but NoQalms data types were logged at 20-second intervals. The observations were taken between October and December 1994 en route from Hobart to Casey to Davis and back to Hobart. See the Marine Science Support Data Quality Report via the Related URL section." .
_:c14n6 <http://schema.org/identifier> _:c14n10 .
_:c14n6 <http://schema.org/identifier> _:c14n3 .
_:c14n6 <http://schema.org/identifier> _:c14n4 .
_:c14n6 <http://schema.org/identifier> _:c14n9 .
_:c14n6 <http://schema.org/inLanguage> "en" .
_:c14n6 <http://schema.org/keywords> "EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC TEMPERATURE,SURFACE TEMPERATURE,AIR TEMPERATURE,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC WATER VAPOR,WATER VAPOR INDICATORS,HUMIDITY,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC WINDS,SURFACE WINDS,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC RADIATION,SOLAR RADIATION,EARTH SCIENCE,OCEANS,BATHYMETRY/SEAFLOOR TOPOGRAPHY,SEAFLOOR TOPOGRAPHY,EARTH SCIENCE,ATMOSPHERE,ATMOSPHERIC PRESSURE,OCEAN > SOUTHERN OCEAN,GEOGRAPHIC REGION > POLAR,ECHO SOUNDERS,SHIPS,R/V AA,R/V Aurora Australis,BATHYMETRY,MARINE,OCEANOGRAPHY" .
_:c14n6 <http://schema.org/license> <http://creativecommons.org/licenses/by/4.0/> .
_:c14n6 <http://schema.org/name> "Aurora Australis Voyage 2 1994-95 Underway Data" .
_:c14n6 <http://schema.org/publisher> _:c14n1 .
_:c14n6 <http://schema.org/sourceOrganization> _:c14n2 .
_:c14n6 <http://schema.org/spatialCoverage> _:c14n5 .
_:c14n6 <http://schema.org/spatialCoverage> _:c14n8 .
_:c14n6 <http://schema.org/temporalCoverage> "1994-10-22/1994-12-01" .
_:c14n6 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Dataset> .
_:c14n7 <http://schema.org/name> "REEVE, JONO" .
_:c14n7 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .
_:c14n8 <http://schema.org/description> "text northlimit=-44.0; southlimit=-69.0; westlimit=79.0; eastLimit= 148.0; projection=WGS84" .
_:c14n8 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Place> .
_:c14n9 <http://schema.org/propertyID> "URL" .
_:c14n9 <http://schema.org/value> "https://data.aad.gov.au/metadata/records/199495020" .
_:c14n9 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/PropertyValue> .
@fils any insight you have would be helpful!
For my own reference: we are normalizing this JSON-LD using https://github.com/piprate/json-gold/
Hmm, but https://json-ld.org/playground/1.0/ says it's invalid. - but json-gold specifically targets JSON-LD 1.1.
Progress: @fils just pointed out that my example document had "@context": "https://schema.org/",
, which is valid for JSON-LD 1.0, but not for 1.1, which is what json-gold does use.
If I add the following before normalizing the document, I get the expected output:
case string:
myInterface["@context"] = map[string]interface{}{"@vocab": myInterface["@context"]}
}
We're looking into seeing whether there's a 1.0 switch for json-gold or something.
Or we could somehow use an older version of the library, from when it supported 1.0. The problem then becomes one of heuristics - how do we decide which to use? Maybe it could be a setting in the config YAML.
At any rate, I just ran a crawl of the AADC and NSIDC repositories, and this fixed the issue that I was seeing. So the next question is: what's the best way for us to support JSON-LD 1.0?
Unfortunately, this gives me the same badly normalized graphs as processing with JSON-LD 1.1.
// Sniff for JSON-LD 1.0; the default is 1.1
switch myInterface["@context"].(type) {
case string:
fmt.Println("JSON-LD 1.0 detected; processing with that mode.")
options.ProcessingMode = "json-ld-1.0"
}
normalizedTriples, err := proc.Normalize(myInterface, options)
A hybrid approach seems to work for the nsidc, but generates duplicate documents for the AADC:
switch myInterface["@context"].(type) {
case string:
options.ProcessingMode = "json-ld-1.0"
myInterface["@context"] = map[string]interface{}{"@vocab": myInterface["@context"]}
}
I have https://github.com/gleanerio/gleaner/compare/dev...json-1.0?expand=1 going, but it seems to me that JLDProc will need to know about the domain we're working in somehow in order to be able to use the right JSON processing option. So some work will need to be done to be able to plumb that through.
Context issues. #129 #130 should help resolve this, for empty normalize triples. New Identifier Approach will help a bit with this. Will find an ID... if it NormalizeTriples comes back empty, then it will sha the whole JSON.
This one is puzzling me. So, in my logs for crawling http://nsidc.org/, I have a bunch of non-identical json-ld objects, which are getting the same hash generated for them. I poked around and figure out that this is because
proc.Normalize
(line 38 in calcShaNorm.go) is generating an empty string. And when you calculate the SHA of a bunch of identical empty strings, it's going to be the same.Here's the config to crawl that site:
AND, also, they have their context specified with no trailing slash, and not https, so you need to add this to
contextmaps
:Is that a clue, there? Is json-gold not able to normalize a json-ld object that is set up this way?
I'm also finding that once I am able to get unique JSON-LD objects for each of the AADC sites in their sitemap*, it only generates 3 different SHAs for the whole set of them. I haven't looked into that much further.