FigShare uses with external resources are kinda broken

oxinabox commented 6 years ago

This is a pathological case: http://doi.org/10.6084/m9.figshare.5557801.v1 It is a Document on Figshare with an external file

I do not think this is worth fixing any time soon. It is a fairly rare corner case. And fiddly to fix.

I am just noting it down for record keeping

Wrong Outputs:

Figshare generator:

julia> generate(Figshare(), "http://doi.org/10.6084/m9.figshare.5557801.v1") |> println
WARNING: Generated registration block uses MD5 hashes, the MD5.jl package will be required.
register(DataDep(
    "Practices and documentation in the Open Source community",
    """
        Dataset: Practices and documentation in the Open Source community
        Website: https://figshare.com/articles/Practices_and_documentation_in_the_Open_Source_community/5557801
        Author: Chris Holdgraf, 0000-0002-8748-6546
        License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
        Date: 2017-10-31T20:44:31Z

        Responses and analysis code to a questionnaire asking open source developers about their practices in open source software, and their beliefs about documentation's role in the community.

        Please cite this work:
        Holdgraf, Chris; 0000-0002-8748-6546 (2017): Practices and documentation in the Open Source community. figshare. Dataset.
        if you use this in your research.
    """,
    Any["https://github.com/choldgraf/blog-documentation_questionnaire"],
    [(md5, )]
))

Broken hash, and the URL does not point to a downloadable file.

JSONLD_Web generator

julia> generate(JSONLD_Web(), "http://doi.org/10.6084/m9.figshare.5557801.v1") |> println
register(DataDep(
    "Practices and documentation in the Open Source community",
    """
        Dataset: Practices and documentation in the Open Source community
        Website: http://doi.org/10.6084/m9.figshare.5557801.v1
        Author: Chris Holdgraf, 0000-0002-8748-6546
        Date: missing
        License: missing

        Responses and analysis code to a questionnaire asking open source developers about their practices in open source software, and their beliefs about documentation&#39;s role in the community.
    """,
    String["https://github.com/choldgraf/blog-documentation_questionnaire"],
))

URL wrong, still

(normal) incomplete outputs

DataCite Generator

julia> generate(DataCite(), "http://doi.org/10.6084/m9.figshare.5557801.v1") |> println
INFO: DataCite based generation can only generate partial registration blocks, as DataCite metadata does not (currently) include the URL to the resource. You will have to edit in the URL after generation.
register(DataDep(
    "Practices and documentation in the Open Source community",
    """
        Dataset: Practices and documentation in the Open Source community
        Website: https://doi.org/10.6084/m9.figshare.5557801.v1
        Author: Chris Holdgraf, 0000-0002-8748-6546
        License: https://creativecommons.org/licenses/by/4.0/
        Date: 2017

        Responses and analysis code to a questionnaire asking open source developers about their practices in open source software, and their beliefs about documentation's role in the community.

        Please cite this dataset:
        Holdgraf, C., & 0000-0002-8748-6546. (2017). Practices and documentation in the Open Source community [Data set]. Figshare. https://doi.org/10.6084/m9.figshare.5557801.v1

        if you use this in your research.
    """,
    String["PUT DOWNLOAD URL HERE"],

))

This is actually as good as DataCite ever is.

JSON_DOI

julia> generate(JSONLD_DOI(), "http://doi.org/10.6084/m9.figshare.5557801.v1") |> println
register(DataDep(
    "Practices and documentation in the Open Source community",                                                                       """
        Dataset: Practices and documentation in the Open Source community
        Website: http://doi.org/10.6084/m9.figshare.5557801.v1
        Author: Chris Holdgraf, 0000-0002-8748-6546
        Date: 2017
        License: https://creativecommons.org/licenses/by/4.0

        Responses and analysis code to a questionnaire asking open source developers about their practices in open source software, and their beliefs about documentation's role in the community.
    """,
    missing,
))

This is fine, just like DataCite it is as usual missing URLs.

SebastinSanty commented 6 years ago

Maybe implement some sort of recursive download for github based repos (or any folder based format for that matter) in DataDeps.jl?

oxinabox commented 6 years ago

Maybe yes. like some kind of (opt-in?) post processing that tries to generate MetaData for the URLS that are being downloaded, and then take the URLs from that or something.

Since the Github generator has the files right but inferior metadata on creator etc.

julia> generate(GitHub(), "https://github.com/choldgraf/blog-documentation_questionnaire") |> println
register(DataDep(
    "blog-documentation_questionnaire",
    """
        Dataset: blog-documentation_questionnaire
        Website: https://github.com/choldgraf/blog-documentation_questionnaire
        License: Unknown

        # blog-documentation_questionnaire
        A public repository for data + analyses for a blog post on documentation
    """,
    Any[Any["https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/data/contribs.csv", "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/data/credit_enjoyment.csv"], Any["https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/figures/plot_contrib_type_bar.png", "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/figures/plot_credit_enjoyment.png", "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/figures/plot_diff_hist.png", "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/figures/plot_docs_diff_compare.png", "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/figures/plot_docs_usual_should.png"], "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/.gitignore", "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/README.md", "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/analysis.py", "https://cdn.rawgit.com/choldgraf/blog-documentation_questionnaire/1e145ef3d167d7fe8fd48434433069ae3d3f0193/plot_figs.py"],

))

oxinabox commented 5 years ago

While I remember FigShare is actually breaking the spec. as per https://schema.org/DataDownload .

contentUrl is only for linking to "Actual bytes of the media object"

They should be using url or mainEntityOfPage When linking to external sites like that.

oxinabox / DataDepsGenerators.jl