Closed oxinabox closed 6 years ago
DataDryad also supports this, though like DataVerse it doesn't have download links. Having this would bring us to supporting DataDryad via 4 methods.
@SebastinSanty
to get you started so you see what I mean by getting the script
element out of the page
here is the core of the code to do that.
These sites are not exposing it as a HTTP content type. Rather they are just sticing it in script blocks somewhere in theire page. So that google etc finds it and indexes it well
julia> using HTTP
julia> using Gumbo, Cascadia, AbstractTrees
julia> using JSON
julia> using DataDepsGenerators: getpage, text_only
julia> function get_linked_data(url)
page=getpage(url)
# XPATH '//script[@type="application/ld+json"]/text()
pattern = sel"script[type=\"application/ld+json\"]"
jsonld_blocks = matchall(pattern, page.root)
if length(jsonld_blocks)==0
error("No JSON-LD Linked Data Found")
end
@assert length(jsonld_blocks)==1
script_block = text_only(first(jsonld_blocks))
JSON.parse(script_block)
end
get_linked_data (generic function with 1 method)
julia>
julia>
julia> get_linked_data("https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey")
Dict{String,Any} with 20 entries:
"isAccessibleForFree" => true
"keywords" => Any["technology and applied sciences > computing > internet", "technology and ap…
"discussionUrl" => "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey/discussion"
"alternateName" => "Individual responses on the 2018 Developer Survey fielded by Stack Overflow"
"name" => "Stack Overflow 2018 Developer Survey"
"sameAs" => "https://staging.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey"
"thumbnailUrl" => "https://kaggle2.blob.core.windows.net/datasets-images/26658/33968/fc9143cc6b0f883c51b…
"distribution" => Any[Dict{String,Any}(Pair{String,Any}("requiresSubscription", true),Pair{String,Any}("…
"version" => 2
"description" => "### Context\n\nEach year, we at [Stack Overflow](https://stackoverflow.com/) ask the …
"@context" => "http://schema.org/"
"creator" => Dict{String,Any}(Pair{String,Any}("name", "Stack Overflow"),Pair{String,Any}("image", …
"interactionStatistic" => Any[Dict{String,Any}(Pair{String,Any}("interactionType", "http://schema.org/CommentAct…
"url" => "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey"
"includedInDataCatalog" => Dict{String,Any}(Pair{String,Any}("name", "Kaggle"),Pair{String,Any}("@type", "DataCat…
"commentCount" => 3
"identifier" => "26658"
"license" => Dict{String,Any}(Pair{String,Any}("name", "Database: Open Database, Contents: Database…
"@type" => "Dataset"
"dateModified" => "2018-05-15T16:59:54.437"
julia>
julia> get_linked_data("https://zenodo.org/record/1287281")
Dict{String,Any} with 14 entries:
"keywords" => Any["New Zealand", "2016 Earthquake", "Landslides"]
"name" => "Map of Co-Seismic Landslides for the M 7.8 Kaikoura, New Zealand Earthquake"
"distribution" => Any[Dict{String,Any}(Pair{String,Any}("fileFormat", "pdf"),Pair{String,Any}("contentUrl", "htt…
"description" => "<p>Prepared by the Research Group on Earthquake Geology in Greece (http://eqgeogr.weebly.com/…
"version" => "2"
"@context" => "https://schema.org/"
"@id" => "https://doi.org/10.5281/zenodo.1287281"
"creator" => Any[Dict{String,Any}(Pair{String,Any}("name", "Valkaniotis Sotiris"),Pair{String,Any}("@id", "…
"datePublished" => "2016-12-20"
"url" => "https://zenodo.org/record/1287281"
"inLanguage" => Dict{String,Any}(Pair{String,Any}("name", "English"),Pair{String,Any}("@type", "Language"),Pai…
"license" => "https://creativecommons.org/licenses/by/4.0/"
"identifier" => "https://doi.org/10.5281/zenodo.1287281"
"@type" => "Dataset"
julia>
julia> get_linked_data("http://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/ZS6XX1")
Dict{String,Any} with 14 entries:
"keywords" => Any["Agricultural Sciences", "Pigeonpea", "Progenies", "Trials", "Agronomic data", "Da…
"schemaVersion" => "https://schema.org/version/3.3"
"name" => "Phenotypic evaluation data of medium duration Pigeonpea advanced varieties trial"
"author" => Any[Dict{String,Any}(Pair{String,Any}("name", "Sameer Kumar, CV"),Pair{String,Any}("af…
"description" => "This database includes the research work carried out on development of medium duratio…
"version" => "1"
"@context" => "http://schema.org"
"datePublished" => "2017-12-30"
"includedInDataCatalog" => Dict{String,Any}(Pair{String,Any}("name", "ICRISAT Dataverse"),Pair{String,Any}("@type…
"provider" => Dict{String,Any}(Pair{String,Any}("name", "Dataverse"),Pair{String,Any}("@type", "Orga…
"identifier" => "http://dx.doi.org/10.21421/D2/ZS6XX1"
"@type" => "Dataset"
"dateModified" => "2017-12-30"
"license" => Dict{String,Any}(Pair{String,Any}("text", "<img src =\"https://licensebuttons.net/l/by…
julia>
julia> get_linked_data("https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for_dangerous_and_non_dangerous_crowd_behavior_/186003")
Dict{String,Any} with 11 entries:
"variablesMeasured" => "none"
"keywords" => "examples, coordinated, uncoordinated, non-dangerous"
"name" => "shows examples of coordinated and uncoordinated motion for dangerous and non-dangerous cr…
"sameAs" => "https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for…
"distribution" => Any[Dict{String,Any}(Pair{String,Any}("contentUrl", "https://ndownloader.figshare.com/file…
"version" => "1"
"description" => "shows examples of coordinated and uncoordinated motion for dangerous and non-dangerous cr…
"@context" => "http://schema.org"
"creator" => Any[Dict{String,Any}(Pair{String,Any}("name", "Florian Raudies"),Pair{String,Any}("@type",…
"url" => "https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for…
"@type" => "Dataset"
Then it is just a matter of dealing with missing fields,
and fields being different names.
e.g. creator
or author
either is acceptable it appears.
So while that code is for just ripping it out of script elements, according to https://github.com/oxinabox/DataDepsGenerators.jl/issues/29#issuecomment-401778818 There is a service to content negotiate for JSON-LD for CrossRef and DataCite, (with 1 interface).
So we can have a generator type
JSONLD <: Repo
JSONLD_Web <: JSONLD
JSONLD_DOI <: JSONLD
And then define all the method on JSONLD
,
execept for the generate
function, for whone fror Web it will do the search I demo'ed above.
And for DOI it will content negotiate with data.datacite.org
.
And all the actually code for deal with the JSON once we have it can be shared.
Closed in #43
I wonder if the deployment of linked data solutions is widespreed enough that we can use that for arbitrary URLs. Linked data formats like JSON-LD are widely deployed because search engines etc like them.
Not evertyhing has it, for example Github for example doesn't have a JSON-LD.
Figshare seems to include JSON-LD on every page:
This has everything we want including a download URL (except hashes), and
eg1
eg 2
Dataverse JSON-LD (no download)
Looking at the icrisat dataverse site: for http://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/ZS6XX1
It seems like it has everything we want except the download URL (and hash).
Eg 1
e.g. 2
Same for the Harvard instance of DataVerse
Kaggel JSON-LD
Kaggel's JSON-LD is actually really detailed. Everything we want except hashes I think. I was expecting Kaggel to be really difficult, since they are not part of something really big.
Zenodo
So if we were to make a JSON-LD based generate it would need to be robust against things being missing, but JSON-LD is a well-defined schema so they would be missing in predictable ways.