oxinabox / DataDepsGenerators.jl

Utility for developers to help define DataDeps registration blocks, for reusing existing Data with DataDeps.jl
Other
18 stars 6 forks source link

Multisource data acquisition #47

Closed SebastinSanty closed 5 years ago

SebastinSanty commented 5 years ago

@oxinabox I am not sure, but the datatypes are not matching for some reason. Can you have a look?

PS: I am running only include("test/CKAN.jl") and only one generator for now without any async part, so as to debug and localize errors easily.

Also, your reviews on the implementation will be really helpful.

SebastinSanty commented 5 years ago

For your quick reference, this is the datatypes I am returning:

String
String
String
String
Array{AbstractString,1}
String
String
String
String
String
Array{Any,1}
String

And these are the metadata datatypes:

shortname::Opt{AbstractString}
fullname::Opt{AbstractString}
website::Opt{AbstractString}
description::Opt{AbstractString}
author::Opt{Vector{AbstractString}}
maintainer::Opt{AbstractString}
license::Opt{AbstractString}
published_date::Opt{Union{Date, DateTime,AbstractString}}
create_date::Opt{Union{Date, DateTime,AbstractString}}
modified_date::Opt{Union{Date, DateTime,AbstractString}}
paper_cite::Opt{AbstractString}
dataset_cite::Opt{AbstractString}
dataurls::Opt{Vector}
datachecksums::Any
SebastinSanty commented 5 years ago

Hi @oxinabox , I think this missed to hit your mailbox. A small review will be appreciated :)

oxinabox commented 5 years ago

This is not great.

You need to make sure your code is consistent and logical for how things of different types are combined. This means that merging 2 missings should give a missing, not an empty string. Similar for what happens with merging vectors. This is why you are getting type errors.

In general you really want to be constructing a function that takes in something like combine(afield, bfield, Val{fieldname}()) Which you could then invoke in a loop (or via reduce). You only need pairwise rules for combining each field.

That way you can write dispatches, like combine(::Missing, bfield, ::Val)=bfield, or

function combine(aauthors::Vector, bauthors::Vector, ::Val{:authors})
    if length(aauthors) > length(bauthors)
        aauthors
    else 
        bauthors
    end
end

Or a rule that says the shorter (nonzero) license text is better. Etc. To express the rules concisely and clearly. You want to consider how the code should be structured?

Global variables are also bad. Which I feel you should be well aware of. They are not being useful here at all; and are a serious code-smell.

I think maybe you want to take a few steps back, and take another swing at this PR. With a fresh branch from master. Think about what exactly the goals are; in a precise way. And then break it down a bit more; before you start coding. You could start you new PR, just with the changes to the Readme that will be required to explain to the user how to use the new feature you'll be implementing. Then in that PR you can outline your plans for what to change, and how, before you code it. Then I can give feedback on that.

SebastinSanty commented 5 years ago

I had made some very silly mistakes. Anyways, I have solved them. Tests are passing now. I have accounted to what you have said above. But I wasn't sure of how we can work with pairwise combining as that will lead to O(n^2), which can be easily be avoided. Please let me know your thoughts on the current version. Also the create_generators() will create Array of generators suitable for the purpose. Eg: If it is a DOI it makes sense to work with certain generators like DataCite, Figshare. There is no need to scan through all of the generators.

I'll continue this for multiple generators (try catch for different fields in generators is in a different branch), and the async part after your feedback.

oxinabox commented 5 years ago

Consider this: While it is pairwise, the result is made by comparing (combinging) the current result to the next in a sequence/

This the functional reduce (or fold) operation. (You can look it up on wikipedia or similar no doubt. Or in the julia docs)

julia> combine(::Missing, ::Missing) = missing
combine (generic function with 1 methods)

julia> combine(::Missing, x) = x
combine (generic function with 2 methods)

julia> combine(x, ::Missing) = x
combine (generic function with 3 methods)

julia> combine(x,y) = length(x) > length(y) ? x : y
combine (generic function with 4 methods)

julia> function combine_all(values)
               ret = missing
               for value in values
                       print("merging $ret \t with $value")
                       ret = combine(ret, value)
                       println("\t\t-------------> ", ret)
               end
               ret
       end
combine_all (generic function with 1 method)

julia> combine_all([missing, missing, "abcde", missing, "ab", "longest", "short"])
merging missing          with missing           -------------> missing
merging missing          with missing           -------------> missing
merging missing          with abcde             -------------> abcde
merging abcde            with missing           -------------> abcde
merging abcde            with ab                -------------> abcde
merging abcde            with longest           -------------> longest
merging longest          with short             -------------> longest
"longest"
oxinabox commented 5 years ago

Re: remove_cite_version That needs some comments explaining why, and I kind of feel we shouldn't do this -- that information is useful

I am guessing it is because the same generator, on different days, includes that information or doesn't include that information (I feel like I've seen that happen before). So not removing it breaks are test. If that is the case, then we should raise a bug upstream, that they are not returning consistent results. and we should do a workaround in the testing script only, to strip that string. Like how the github tests strips out all URLS

oxinabox commented 5 years ago

re: rule_dispatcher I think you are thinking on the right lines but take a good look at what documentation and Stackoverflow posts, you can find on "Dispatch On Value" in julia, -- that is things like Val{:authors), and maybe play around with solving a toy problem using it, to get a good feel for it as a tool to solve problems. It will make the code a fair bit cleaner.

oxinabox commented 5 years ago

Also the create_generators() will create Array of generators suitable for the purpose. Eg: If it is a DOI it makes sense to work with certain generators like DataCite, Figshare. There is no need to scan through all of the generators.

But why not scan through all the generators?

For some generators (e.g. JSON-LD) it is impossible to know if it will work without downloading the page. At which point it will throw an exception.

For others like all the ones that only eat DOIs, it will throw an exception if it doesn't match_doi, in which case it is basically takes the same amount of time as create_generators().

And we should be be able to use Async (or if it comes to it outright multiprocessing parallelism), to make checking extras take approximately the same wall-time as checking just one.

codecov-io commented 5 years ago

Codecov Report

Merging #47 into master will increase coverage by 0.18%. The diff coverage is 81.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #47      +/-   ##
==========================================
+ Coverage   93.24%   93.43%   +0.18%     
==========================================
  Files          18       18              
  Lines         370      411      +41     
==========================================
+ Hits          345      384      +39     
- Misses         25       27       +2
Impacted Files Coverage Δ
src/CKAN.jl 100% <100%> (+5.88%) :arrow_up:
src/utils.jl 80.95% <100%> (+4.48%) :arrow_up:
src/DataCite.jl 100% <100%> (+4.34%) :arrow_up:
src/JSONLD/JSONLD_Web.jl 100% <100%> (+10%) :arrow_up:
src/Figshare.jl 100% <100%> (+3.7%) :arrow_up:
src/JSONLD/JSONLD_DOI.jl 100% <100%> (ø) :arrow_up:
src/DataOneV2/DataOneV2.jl 46.66% <46.66%> (-7.18%) :arrow_down:
src/DataDepsGenerators.jl 90.74% <87.17%> (-0.93%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 558fd00...a852b15. Read the comment docs.

oxinabox commented 5 years ago

Reposting this in the main thread so it doesn't get vanished by github:

Stick into utils.jl

"""
    leaf_subtypes(T)
Returns all the nonabstract types decedent from `T`.
"""
function leaf_subtypes(T)
       if isleaftype(T)
           T
       else
           vcat(leaf_subtypes.(subtypes(T))...)
       end
end

Into the main file for generation stick something like

function generate(repo::DataRepo, dataname; kwargs...)
    generate([repo], dataname; kwargs...)
end

function generate(dataname; kwargs...)
    all_repos = [T() for T in leaf_subtypes(DataRepo)]
    generate(all_repos , dataname; kwargs...)
end

function generate(repos, dataname, shortname = nothing, show_failures=false)
    retrieved_metadatas = MetaData[]
    failures = Tuple{DataRepo, Exception}[]

    # Get all the metadata we can
    @sync for repo in repos
        @async try
            metadata = find_metadata(repo, dataname, shortname)
            push!(retrieved_metadatas, metadata)
        catch err
            push!(failures, (repo, err))
        end
    end

    # Displace errors if required
    if length(retrieved_metadatas) == 0 || show_failures
        for (repo, err) in failures
            println(repo, " failed due to")
            println(err)
            println()
        end
    end

    # merge them
    aggregate(retrieved_metadatas)
end

I haven't tested this code. But hopefully you can get the idea.


Please consider and incorperate something like this. I believe this is the best way to handle calling multiple generators and allowing some to fail.

yes, it involved catching all exceptions. But that is ok at the top-level. It is an UX thing, since we are capturing them, and only displaying them if they matter (i.e. if all generators failed.)

oxinabox commented 5 years ago

Right the version I just pushed works. I disabled the Async. That is not the important part of this PR anyway. It may be an actual julia bug. If so, solving it is better to do after upgrading to julia 0.7. So lets leave it for now

Every thing else I said still need to be done still need to be done.

Also I know what is wrong with the JSON-LD for figshare and I've risen an issue. it can stay commented out for now.

SebastinSanty commented 5 years ago

Oh great! I was a bit confused too. Still I was reading up on https://docs.julialang.org/en/stable/manual/methods/index.html#man-methods-orthogonalize-1.

oxinabox commented 5 years ago

An excellent thing to be reading.