oxinabox / DataDepsGenerators.jl

Utility for developers to help define DataDeps registration blocks, for reusing existing Data with DataDeps.jl
Other
18 stars 6 forks source link

Use linked data to handle (nearly) arbitrary websites? #30

Closed oxinabox closed 6 years ago

oxinabox commented 6 years ago

I wonder if the deployment of linked data solutions is widespreed enough that we can use that for arbitrary URLs. Linked data formats like JSON-LD are widely deployed because search engines etc like them.

Not evertyhing has it, for example Github for example doesn't have a JSON-LD.

Figshare seems to include JSON-LD on every page:

This has everything we want including a download URL (except hashes), and

eg1

lyndon@agent:~$ curl -LH "Accept: text/html" https://doi.org/10.6084/m9.figshare.5350216.v1   2> /dev/null | xmllint -nowarning --html --xpath '//script[@type="application/ld+json"]' - 2> /dev/null
<script type="application/ld+json">{
    "@context": "http://schema.org",
    "@type": "Dataset",
    "name": "figshare for Institutions - Information booklet",
    "description": "figshare for Institutions information booklet about managing and disseminating research data to make it more citable, shareable and discoverable.",
    "url": "https://figshare.com/articles/figshare_for_Institutions_-_Information_booklet/5350216",
    "sameAs": "https://figshare.com/articles/figshare_for_Institutions_-_Information_booklet/5350216",
    "version": "1",
    "keywords": "figshare, figshare for Institutions, data management, research management, research data, discoverability, FFI, Japan, September 2017, openscijapan2017",
    "variablesMeasured": "none",
    "creator": [

      {
        "@type": "Person",
        "name": "figshare figshare"
      }
    ],
    "distribution": [
      {
        "@type": "DataDownload",
        "contentUrl": "https://ndownloader.figshare.com/files/9194386",
        "license": "https://creativecommons.org/licenses/by/4.0/"
      }
    ]
  }

eg 2

lyndon@agent:~$ curl -LH "Accept: text/html" https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for_dangerous_and_non_dangerous_crowd_behavior_/186003  2> /dev/null | xmllint -nowarning --html --xpath '//script[@type="application/ld+json"]' - 2> /dev/null

<script type="application/ld+json">{
    "@context": "http://schema.org",
    "@type": "Dataset",
    "name": "shows examples of coordinated and uncoordinated motion for dangerous and non-dangerous crowd behavior.",
    "description": "shows examples of coordinated and uncoordinated motion for dangerous and non-dangerous crowd behavior.\n",
    "url": "https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for_dangerous_and_non_dangerous_crowd_behavior_/186003",
    "sameAs": "https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for_dangerous_and_non_dangerous_crowd_behavior_/186003",
    "version": "1",
    "keywords": "examples, coordinated, uncoordinated, non-dangerous",
    "variablesMeasured": "none",
    "creator": [

      {
        "@type": "Person",
        "name": "Florian Raudies"
      }

      ,{
        "@type": "Person",
        "name": "Heiko Neumann"
      }

    ],
    "distribution": [

      {
        "@type": "DataDownload",
        "contentUrl": "https://ndownloader.figshare.com/files/515509",
        "license": "https://creativecommons.org/licenses/by/4.0/"
      }

    ]
  }</script>

Dataverse JSON-LD (no download)

Looking at the icrisat dataverse site: for http://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/ZS6XX1

It seems like it has everything we want except the download URL (and hash).

Eg 1

lyndon@agent:~$ curl -LH "Accept: text/html" http://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/ZS6XX1  2> /dev/null | xmllint -nowarning --html --xpath '//script[@type="application/ld+json"]/text()' - 2> /dev/null | jq .
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "identifier": "http://dx.doi.org/10.21421/D2/ZS6XX1",
  "name": "Phenotypic evaluation data of medium duration Pigeonpea advanced varieties trial",
  "author": [
    {
      "name": "Sameer Kumar, CV",
      "affiliation": "ICRISAT"
    },
    {
      "name": "Anupama Hingane",
      "affiliation": "ICRISAT"
    }
  ],
  "datePublished": "2017-12-30",
  "dateModified": "2017-12-30",
  "version": "1",
  "description": "This database includes the research work carried out on development of medium duration pigeonpea cultivars including advanced varieties at ICRISAT Center, Patancheru (17°30'N 78°16'46E). Pigeon pea is a very important grain legume crop for food other uses in Asia and Africa. It is often cross-pollinated species with a diploid number of 2n= 2x22 and genome size of 858Mbp. Every year 50 to 100 and above new crosses (and also CMS hybrids) will be made evaluated in nurseries to develop new high yielding cultivars with adaptability to different climatic/agronomic zones. Based on their agronomic performance in nurseries for maturity time, branching pattern and number of branches, pod color, pod yield and other pest and diseases tolerance characters etc, the superior progenies will be selected and advanced to further generations (to F5s). The F5 progenies selected based on preliminary/nursery data will be evaluated along with controls in replicated (twice or thrice) trials every year for further agronomic evaluation and selection. The agronomic data (days to 50% flowering and/or maturity, plant height, grain yield, grain size and color etc) of the progenies evaluated in years 2015 were presented herewith. The trial details and plot sizes were given. This data helps us to select and advance further. Finally the few best progenies among them will be evaluated in on-farm trials (OFTs) and in multi-location trials. The best performed progenies will be considered to promote/release in respective agronomic zones.  Experiment location on Google Map",
  "keywords": [
    "Agricultural Sciences",
    "Pigeonpea",
    "Progenies",
    "Trials",
    "Agronomic data",
    "Days to 50% flowering",
    "Maturity",
    "Plant height",
    "Grain yield",
    "Grain size",
    "Plant stand",
    "100 seed weight",
    "Seed per pod",
    "Seed color"
  ],
  "schemaVersion": "https://schema.org/version/3.3",
  "license": {
    "@type": "Dataset",
    "text": "<img src =\"https://licensebuttons.net/l/by/4.0/88x31.png\">\r\nThese data and documents are licensed under a Creative Commons Attribution 4.0 International license. You may copy, distribute and transmit the data as long as you acknowledge the source through proper data citation. Disclaimer Whilst utmost care has been taken by ICRISAT and data authors while collecting and compiling the data, the data is however offered \"as is\" with no express or implied warranty. In no event shall the data authors, ICRISAT, or relevant funding agencies be liable for any actual, incidental or consequential damages arising from use of the data. By using the ICRISAT Dataverse, the user expressly acknowledges that the Data may contain some nonconformities, defects, or errors. No warranty is given that the data will meet the user's needs or expectations or that all nonconformities, defects, or errors can or will be corrected. The user should always verify actual data; therefore the user bears all responsibility in determining whether the data is fit for the user’s intended use. The user of the data should use the related publications as a baseline for their analysis whenever possible. Doing so will be an added safeguard against misinterpretation of the data. Related publications are listed in the metadata section of the Dataverse study."
  },
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "name": "ICRISAT Dataverse",
    "url": "http://dataverse.icrisat.org"
  },
  "provider": {
    "@type": "Organization",
    "name": "Dataverse"
  }
}

e.g. 2

Same for the Harvard instance of DataVerse

lyndon@agent:~$ curl -LH "Accept: text/html" https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/4QKVBO  2> /dev/null | xmllint -nowarning --html --xpath '//script[@type="application/ld+json"]/text()' - 2> /dev/null | jq .
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "identifier": "http://dx.doi.org/10.18738/T8/4QKVBO",
  "name": "X-ray CT Scans of Gymnotus carapo (banded ) (TNHC 17122)",
  "author": [
    {
      "name": "The University of Texas High-Resolution X-ray CT Facility (UTCT)",
      "affiliation": "University of Texas at Austin"
    }
  ],
  "datePublished": "2018-03-30",
  "dateModified": "2018-03-30",
  "version": "1",
  "description": "X-ray CT Scans of the head of Gymnotus carapo (TNHC 17122; Venezuela, Portuguesa, Co., Cano Maraca at Urriola's Ranch, 35 km SE Guanare, 9 January 1989) for Dr. Julian Humphries of The University of Texas at Austin, Dr. Timothy Rowe of the Department of Geological Sciences, The University of Texas at Austin, and Digimorph. Specimen scanned by Matthew Colbert 25 August 2003. Voxel size XandY=0.01914mm;Z=0.041mm. Total slices = 465. Please acknowledge The University of Texas High Resolution X-ray CT Facility (UTCT), and NSF grant IIS-0208695 when using these data.",
  "keywords": [
    "Medicine, Health and Life Sciences",
    "X-ray CT Scan Data Computed Tomography"
  ],
  "schemaVersion": "https://schema.org/version/3.3",
  "license": {
    "@type": "Dataset",
    "text": "CC0",
    "url": "https://creativecommons.org/publicdomain/zero/1.0/"
  },
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "name": "Texas Data Repository Dataverse",
    "url": "https://dataverse.tdl.org"
  },
  "provider": {
    "@type": "Organization",
    "name": "Dataverse"
  }
}

Kaggel JSON-LD

Kaggel's JSON-LD is actually really detailed. Everything we want except hashes I think. I was expecting Kaggel to be really difficult, since they are not part of something really big.

lyndon@agent:~$ curl -LH "Accept: text/html" https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey 2> /dev/null | xmllint -nowarning --html --xpath '//script[@type="application/ld+json"]/text()' - 2> /dev/null | jq .

{
  "@context": "http://schema.org/",
  "@type": "Dataset",
  "name": "Stack Overflow 2018 Developer Survey",
  "description": "## Context\n\nEach year, we at [Stack Overflow](https://stackoverflow.com/) ask the developer community about everything from their favorite technologies to their job preferences. This year marks the eighth year we’ve published our Annual Developer Survey results—with the largest number of respondents yet. Over 100,000 developers took the 30-minute survey in January 2018.\n\nThis year, we covered a few new topics ranging from artificial intelligence to ethics in coding. We also found that underrepresented groups in tech responded to our survey at even lower rates than we would expect from their participation in the workforce. Want to dive into the results yourself and see what you can learn about salaries or machine learning or diversity in tech? We look forward to seeing what you find!\n\n## Content\n\nThis 2018 Developer Survey results are organized on Kaggle in two tables:\n\n**survey_results_public** contains the main survey results, one respondent per row and one column per question\n\n**survey_results_schema** contains each column name from the main results along with the question text corresponding to that column\n\nThere are 98,855 responses in this public data release. These responses are what we consider “qualified” for analytical purposes based on completion and time spent on the survey and included at least one non-PII question. Approximately 20,000 responses were started but not included here because respondents did not answer enough questions, or only answered questions with personally identifying information. Of the qualified responses, 67,441 completed the entire survey.\n\n## Acknowledgements\n\nMassive, heartfelt thanks to all Stack Overflow contributors and lurking developers of the world who took part in the survey this year. We value your generous participation more than you know.\n\n## Inspiration\n\nAt Stack Overflow, we put developers first and want [all developers to feel welcome and included on our site](https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/). Can we use our annual survey to understand what kinds of users are less likely to identify as part of our community, participate, or feel kinship with fellow developers? Check out [our blog post](https://stackoverflow.blog/2018/05/30/public-data-release-of-stack-overflows-2018-developer-survey) for more details.",
  "url": "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey",
  "sameAs": "https://staging.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey",
  "version": 2,
  "keywords": [
    "technology and applied sciences &gt; computing &gt; internet",
    "technology and applied sciences &gt; computing &gt; programming languages",
    "medium",
    "featured"
  ],
  "license": {
    "@type": "CreativeWork",
    "name": "Database: Open Database, Contents: Database Contents",
    "url": "http://opendatacommons.org/licenses/dbcl/1.0/"
  },
  "identifier": "26658",
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "name": "Kaggle",
    "url": "https://www.kaggle.com"
  },
  "creator": {
    "@type": "Organization",
    "name": "Stack Overflow",
    "url": "https://www.kaggle.com/stackoverflow",
    "image": "https://kaggle2.blob.core.windows.net/organizations/66/thumbnail.png%3Fr=403"
  },
  "distribution": [
    {
      "@type": "DataDownload",
      "requiresSubscription": true,
      "encodingFormat": "zip",
      "fileFormat": "zip",
      "contentUrl": "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey/downloads/stack-overflow-2018-developer-survey.zip/2",
      "contentSize": "20557379 bytes"
    },
    {
      "@type": "DataDownload",
      "requiresSubscription": true,
      "encodingFormat": "csv",
      "fileFormat": "csv",
      "contentUrl": "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey/downloads/survey_results_public.csv/2",
      "contentSize": "195595827 bytes"
    },
    {
      "@type": "DataDownload",
      "requiresSubscription": true,
      "encodingFormat": "csv",
      "fileFormat": "csv",
      "contentUrl": "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey/downloads/survey_results_schema.csv/2",
      "contentSize": "23898 bytes"
    }
  ],
  "commentCount": 3,
  "dateModified": "2018-05-15T16:59:54.437",
  "discussionUrl": "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey/discussion",
  "alternateName": "Individual responses on the 2018 Developer Survey fielded by Stack Overflow",
  "isAccessibleForFree": true,
  "thumbnailUrl": "https://kaggle2.blob.core.windows.net/datasets-images/26658/33968/fc9143cc6b0f883c51bd80e16888d0e6/dataset-card.png?t=2018-05-15-16-36-02",
  "interactionStatistic": [
    {
      "@type": "InteractionCounter",
      "interactionType": "http://schema.org/CommentAction",
      "userInteractionCount": 3
    },
    {
      "@type": "InteractionCounter",
      "interactionType": "http://schema.org/DownloadAction",
      "userInteractionCount": 1719
    },
    {
      "@type": "InteractionCounter",
      "interactionType": "http://schema.org/ViewAction",
      "userInteractionCount": 16898
    },
    {
      "@type": "InteractionCounter",
      "interactionType": "http://schema.org/LikeAction",
      "userInteractionCount": 171
    }
  ]
}

Zenodo

lyndon@agent:~$ curl -LH "Accept: text/html" https://zenodo.org/record/1287281 2> /dev/null | xmllint -nowarning --html --xpath '//script[@type="application/ld+json"]/text()' - 2> /dev/null | jq .
{
  "@context": "https://schema.org/",
  "@id": "https://doi.org/10.5281/zenodo.1287281",
  "@type": "Dataset",
  "creator": [
    {
      "@id": "https://orcid.org/0000-0003-0003-2902",
      "@type": "Person",
      "affiliation": "Dr.",
      "name": "Valkaniotis Sotiris"
    },
    {
      "@id": "https://orcid.org/0000-0001-6720-9683",
      "@type": "Person",
      "affiliation": "Dr.",
      "name": "Papathanassiou George"
    },
    {
      "@type": "Person",
      "affiliation": "Prof.",
      "name": "Pavlides Spyros"
    }
  ],
  "datePublished": "2016-12-20",
  "description": "<p>Prepared by the Research Group on Earthquake Geology in Greece (http://eqgeogr.weebly.com/)\n\n<p>Version 2 (updated)\n\n<p>With the release of new Sentinel-2 images, and other available resources for the M7.8 Kaikoura earthquake, we present an update of the Map of Co-Seismic Landslides and Surfaces Ruptures (As of 27/11/2016). Landslides were mapped using Sentinel-2 satellite images from Copernicus, European Space Agency, dated November and December 2016. Images were visually compared with previous last available S2A images without cloud cover (13 September and 26 October) and landslides and large slope failures were manually mapped. Areas covered by cloud are omitted and shown on map. 5875 landslide sites are shown in the map. A small number of landslides could have been mis-identified due to insufficient resolution of the images, small gaps of cloud cover or for other reasons. Also, re-activated landslides on the central mountainous area were unabled to identify due to imagery restrictions (medium resolution, relief shadows etc). Some local gaps in Sentinel imagery still exist due to cloud cover, but we believe the current map is very close to the major distribution of mass movement effects. Surface ruptures were mapped using Sentinel-2 imagery and approximate position from photos of the post-earthquake aerial surveys of Environment Canterbury Regional Council (http://ecan.govt.nz)\n\n<p>KML file contains7355 landslide spots.",
  "distribution": [
    {
      "@type": "DataDownload",
      "contentUrl": "https://zenodo.org/api/files/5a311c7a-bd5e-4df7-be61-341d03ec9a9b/Landslide_Map_V2_A2.pdf",
      "fileFormat": "pdf"
    },
    {
      "@type": "DataDownload",
      "contentUrl": "https://zenodo.org/api/files/5a311c7a-bd5e-4df7-be61-341d03ec9a9b/Landslides_Kaikoura_2016.kmz",
      "fileFormat": "kmz"
    },
    {
      "@type": "DataDownload",
      "contentUrl": "https://zenodo.org/api/files/5a311c7a-bd5e-4df7-be61-341d03ec9a9b/Prelim_Landslide_Map_A2.jpg",
      "fileFormat": "jpg"
    }
  ],
  "identifier": "https://doi.org/10.5281/zenodo.1287281",
  "inLanguage": {
    "@type": "Language",
    "alternateName": "eng",
    "name": "English"
  },
  "keywords": [
    "New Zealand",
    "2016 Earthquake",
    "Landslides"
  ],
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "name": "Map of Co-Seismic Landslides  for the M 7.8 Kaikoura, New Zealand Earthquake",
  "url": "https://zenodo.org/record/1287281",
  "version": "2"
}

So if we were to make a JSON-LD based generate it would need to be robust against things being missing, but JSON-LD is a well-defined schema so they would be missing in predictable ways.

oxinabox commented 6 years ago

DataDryad also supports this, though like DataVerse it doesn't have download links. Having this would bring us to supporting DataDryad via 4 methods.

oxinabox commented 6 years ago

@SebastinSanty
to get you started so you see what I mean by getting the script element out of the page here is the core of the code to do that.

These sites are not exposing it as a HTTP content type. Rather they are just sticing it in script blocks somewhere in theire page. So that google etc finds it and indexes it well

julia> using HTTP
julia> using Gumbo, Cascadia, AbstractTrees
julia> using JSON
julia> using DataDepsGenerators: getpage, text_only

julia> function get_linked_data(url)
           page=getpage(url)
           # XPATH '//script[@type="application/ld+json"]/text()
           pattern = sel"script[type=\"application/ld+json\"]"
           jsonld_blocks = matchall(pattern, page.root)
           if length(jsonld_blocks)==0
               error("No JSON-LD Linked Data Found")
           end
           @assert length(jsonld_blocks)==1
           script_block = text_only(first(jsonld_blocks))
           JSON.parse(script_block)
       end
get_linked_data (generic function with 1 method)

julia>

julia>

julia> get_linked_data("https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey")
Dict{String,Any} with 20 entries:
  "isAccessibleForFree"   => true
  "keywords"              => Any["technology and applied sciences &gt; computing &gt; internet", "technology and ap…
  "discussionUrl"         => "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey/discussion"
  "alternateName"         => "Individual responses on the 2018 Developer Survey fielded by Stack Overflow"
  "name"                  => "Stack Overflow 2018 Developer Survey"
  "sameAs"                => "https://staging.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey"
  "thumbnailUrl"          => "https://kaggle2.blob.core.windows.net/datasets-images/26658/33968/fc9143cc6b0f883c51b…
  "distribution"          => Any[Dict{String,Any}(Pair{String,Any}("requiresSubscription", true),Pair{String,Any}("…
  "version"               => 2
  "description"           => "### Context\n\nEach year, we at [Stack Overflow](https://stackoverflow.com/) ask the …
  "@context"              => "http://schema.org/"
  "creator"               => Dict{String,Any}(Pair{String,Any}("name", "Stack Overflow"),Pair{String,Any}("image", …
  "interactionStatistic"  => Any[Dict{String,Any}(Pair{String,Any}("interactionType", "http://schema.org/CommentAct…
  "url"                   => "https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey"
  "includedInDataCatalog" => Dict{String,Any}(Pair{String,Any}("name", "Kaggle"),Pair{String,Any}("@type", "DataCat…
  "commentCount"          => 3
  "identifier"            => "26658"
  "license"               => Dict{String,Any}(Pair{String,Any}("name", "Database: Open Database, Contents: Database…
  "@type"                 => "Dataset"
  "dateModified"          => "2018-05-15T16:59:54.437"

julia>

julia> get_linked_data("https://zenodo.org/record/1287281")
Dict{String,Any} with 14 entries:
  "keywords"      => Any["New Zealand", "2016 Earthquake", "Landslides"]
  "name"          => "Map of Co-Seismic Landslides  for the M 7.8 Kaikoura, New Zealand Earthquake"
  "distribution"  => Any[Dict{String,Any}(Pair{String,Any}("fileFormat", "pdf"),Pair{String,Any}("contentUrl", "htt…
  "description"   => "<p>Prepared by the Research Group on Earthquake Geology in Greece (http://eqgeogr.weebly.com/…
  "version"       => "2"
  "@context"      => "https://schema.org/"
  "@id"           => "https://doi.org/10.5281/zenodo.1287281"
  "creator"       => Any[Dict{String,Any}(Pair{String,Any}("name", "Valkaniotis Sotiris"),Pair{String,Any}("@id", "…
  "datePublished" => "2016-12-20"
  "url"           => "https://zenodo.org/record/1287281"
  "inLanguage"    => Dict{String,Any}(Pair{String,Any}("name", "English"),Pair{String,Any}("@type", "Language"),Pai…
  "license"       => "https://creativecommons.org/licenses/by/4.0/"
  "identifier"    => "https://doi.org/10.5281/zenodo.1287281"
  "@type"         => "Dataset"

julia>

julia> get_linked_data("http://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/ZS6XX1")
Dict{String,Any} with 14 entries:
  "keywords"              => Any["Agricultural Sciences", "Pigeonpea", "Progenies", "Trials", "Agronomic data", "Da…
  "schemaVersion"         => "https://schema.org/version/3.3"
  "name"                  => "Phenotypic evaluation data of medium duration Pigeonpea advanced varieties trial"
  "author"                => Any[Dict{String,Any}(Pair{String,Any}("name", "Sameer Kumar, CV"),Pair{String,Any}("af…
  "description"           => "This database includes the research work carried out on development of medium duratio…
  "version"               => "1"
  "@context"              => "http://schema.org"
  "datePublished"         => "2017-12-30"
  "includedInDataCatalog" => Dict{String,Any}(Pair{String,Any}("name", "ICRISAT Dataverse"),Pair{String,Any}("@type…
  "provider"              => Dict{String,Any}(Pair{String,Any}("name", "Dataverse"),Pair{String,Any}("@type", "Orga…
  "identifier"            => "http://dx.doi.org/10.21421/D2/ZS6XX1"
  "@type"                 => "Dataset"
  "dateModified"          => "2017-12-30"
  "license"               => Dict{String,Any}(Pair{String,Any}("text", "<img src =\"https://licensebuttons.net/l/by…

julia>

julia> get_linked_data("https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for_dangerous_and_non_dangerous_crowd_behavior_/186003")
Dict{String,Any} with 11 entries:
  "variablesMeasured" => "none"
  "keywords"          => "examples, coordinated, uncoordinated, non-dangerous"
  "name"              => "shows examples of coordinated and uncoordinated motion for dangerous and non-dangerous cr…
  "sameAs"            => "https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for…
  "distribution"      => Any[Dict{String,Any}(Pair{String,Any}("contentUrl", "https://ndownloader.figshare.com/file…
  "version"           => "1"
  "description"       => "shows examples of coordinated and uncoordinated motion for dangerous and non-dangerous cr…
  "@context"          => "http://schema.org"
  "creator"           => Any[Dict{String,Any}(Pair{String,Any}("name", "Florian Raudies"),Pair{String,Any}("@type",…
  "url"               => "https://figshare.com/articles/_shows_examples_of_coordinated_and_uncoordinated_motion_for…
  "@type"             => "Dataset"

Then it is just a matter of dealing with missing fields, and fields being different names. e.g. creator or author either is acceptable it appears.

oxinabox commented 6 years ago

So while that code is for just ripping it out of script elements, according to https://github.com/oxinabox/DataDepsGenerators.jl/issues/29#issuecomment-401778818 There is a service to content negotiate for JSON-LD for CrossRef and DataCite, (with 1 interface).

So we can have a generator type

JSONLD <: Repo
JSONLD_Web <: JSONLD 
JSONLD_DOI <: JSONLD

And then define all the method on JSONLD, execept for the generate function, for whone fror Web it will do the search I demo'ed above. And for DOI it will content negotiate with data.datacite.org. And all the actually code for deal with the JSON once we have it can be shared.

oxinabox commented 6 years ago

Closed in #43