oxinabox / DataDepsGenerators.jl

Utility for developers to help define DataDeps registration blocks, for reusing existing Data with DataDeps.jl
Other
18 stars 6 forks source link

Add Figshare API #34

Closed SebastinSanty closed 6 years ago

SebastinSanty commented 6 years ago

Yes, this is being done using Figshare API. I thought of doing OAI-PMH afterwards as you were more willing to get Figshare and Dataverse integrated.

My thoughts are: #30 in general gives amazing results but, have various attributes missing. However all of them mention the source of the data explicitly. What if we have individual types like Figshare() and Dataverse() and a general JSONLD() which eventually gets the JSON-LD, parses the source, and passes it into one of the types which we have already integrated.

oxinabox commented 6 years ago

My thoughts are: #30 in general gives amazing results but, have various attributes missing. However all of them mention the source of the data explicitly. What if we have individual types like Figshare() and Dataverse() and a general JSONLD() which eventually gets the JSON-LD, parses the source, and passes it into one of the types which we have already integrated.

I had a perhaps slightly more ambitions thought along those lines. Or perhaps just a bit more silly.

Notion being to define a version of generate that doesn't take a repo, and tries all of them (in parallel), assesses the returned Metadata** of those that don't throw exceptions for having the least missing fields and for having the important fields present (like download URL). and selects the best. Or maybe merges them. Then generates the code.

Pseudocode follows:

function generate(dataname)
    failures = Vector()
    results = PriorityQueue()

    asyncmap(all_subtypes(DataRepo)) do repotype
        try
            metadata = find_metadata(repotype(), dataname)
            score = evaluate_metadata(metadata)
            results[score] = metadata
        catch err
            # We want to remember the failure,
            # so if we can't find any that work, we can explain what we tried,
            # and people can make good bug reports if one of those should have worked

            push!(failures, (repotype, err))
        end
    end

    if length(results) > 0
        # We got (at least) one
        metadata = first(results) # get the best
        generate_code(metadata)
    else
        # Got none
        println("We were unable to find usable metadata for dataname")
        println("We tried the following methods, and they failed for the reason listed")
        println()
        for (repo, err)
            println("# $repo")
            println(err)
            println()
        end
    end
end

I think it is more general to just try everything and see what works and how well it does. Rather than rely on trying to use one API (which may or may not exist/be complete) to discover a better one. By running them asynchronously it should be pretty fast, as it won't have to wait for pages to download.

I guess you could call this the "Throw everything at the wall and see what sticks" method of finding the metadata.

(**we'ld probably need to add more optional fields to the Metadata type. Like all the kinda standard stuff that is currently in the free text description)

codecov-io commented 6 years ago

Codecov Report

Merging #34 into master will increase coverage by 0.13%. The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #34      +/-   ##
==========================================
+ Coverage   93.93%   94.07%   +0.13%     
==========================================
  Files          14       15       +1     
  Lines         264      287      +23     
==========================================
+ Hits          248      270      +22     
- Misses         16       17       +1
Impacted Files Coverage Δ
src/DataOneV2/KNB.jl 80% <ø> (ø) :arrow_up:
src/CKAN.jl 95% <ø> (ø) :arrow_up:
src/DataOneV1.jl 100% <ø> (ø) :arrow_up:
src/DataDepsGenerators.jl 94.59% <100%> (+3.16%) :arrow_up:
src/DataCite.jl 95.23% <100%> (-0.92%) :arrow_down:
src/Figshare.jl 93.33% <93.33%> (ø)
src/generic_extractors.jl 100% <0%> (ø) :arrow_up:
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 28d4559...561ef72. Read the comment docs.

SebastinSanty commented 6 years ago

Yes, your sounds more ambitious with the asynchronous part. Sounds interesting and I am up for it.

tries all of them (in parallel)

But even if I am trying to create them, wouldn't I require individual parts (like DataCite, Figshare) to be working. Wouldn't implementing all of them separately and then clubbing them together as you have noted be the better way to proceed?

oxinabox commented 6 years ago

Wouldn't implementing all of them separately and then clubbing them together as you have noted be the better way to proceed?

Yes, that is what I was meaning. Make them all separately and club together the ones that don't throw exceptions for a given identifier.