Closed SebastinSanty closed 5 years ago
For your quick reference, this is the datatypes I am returning:
String
String
String
String
Array{AbstractString,1}
String
String
String
String
String
Array{Any,1}
String
And these are the metadata datatypes:
shortname::Opt{AbstractString}
fullname::Opt{AbstractString}
website::Opt{AbstractString}
description::Opt{AbstractString}
author::Opt{Vector{AbstractString}}
maintainer::Opt{AbstractString}
license::Opt{AbstractString}
published_date::Opt{Union{Date, DateTime,AbstractString}}
create_date::Opt{Union{Date, DateTime,AbstractString}}
modified_date::Opt{Union{Date, DateTime,AbstractString}}
paper_cite::Opt{AbstractString}
dataset_cite::Opt{AbstractString}
dataurls::Opt{Vector}
datachecksums::Any
Hi @oxinabox , I think this missed to hit your mailbox. A small review will be appreciated :)
This is not great.
You need to make sure your code is consistent and logical for how things of different types are combined.
This means that merging 2 missing
s should give a missing
, not an empty string.
Similar for what happens with merging vectors.
This is why you are getting type errors.
In general you really want to be constructing
a function that takes in
something like
combine(afield, bfield, Val{fieldname}())
Which you could then invoke in a loop (or via reduce
).
You only need pairwise rules for combining each field.
That way you can write dispatches, like
combine(::Missing, bfield, ::Val)=bfield
,
or
function combine(aauthors::Vector, bauthors::Vector, ::Val{:authors})
if length(aauthors) > length(bauthors)
aauthors
else
bauthors
end
end
Or a rule that says the shorter (nonzero) license text is better. Etc. To express the rules concisely and clearly. You want to consider how the code should be structured?
Global variables are also bad. Which I feel you should be well aware of. They are not being useful here at all; and are a serious code-smell.
I think maybe you want to take a few steps back, and take another swing at this PR. With a fresh branch from master. Think about what exactly the goals are; in a precise way. And then break it down a bit more; before you start coding. You could start you new PR, just with the changes to the Readme that will be required to explain to the user how to use the new feature you'll be implementing. Then in that PR you can outline your plans for what to change, and how, before you code it. Then I can give feedback on that.
I had made some very silly mistakes. Anyways, I have solved them. Tests are passing now. I have accounted to what you have said above. But I wasn't sure of how we can work with pairwise combining as that will lead to O(n^2), which can be easily be avoided. Please let me know your thoughts on the current version. Also the create_generators()
will create Array of generators suitable for the purpose. Eg: If it is a DOI it makes sense to work with certain generators like DataCite, Figshare. There is no need to scan through all of the generators.
I'll continue this for multiple generators (try catch
for different fields in generators is in a different branch), and the async part after your feedback.
Consider this: While it is pairwise, the result is made by comparing (combinging) the current result to the next in a sequence/
This the functional reduce
(or fold
) operation.
(You can look it up on wikipedia or similar no doubt. Or in the julia docs)
julia> combine(::Missing, ::Missing) = missing
combine (generic function with 1 methods)
julia> combine(::Missing, x) = x
combine (generic function with 2 methods)
julia> combine(x, ::Missing) = x
combine (generic function with 3 methods)
julia> combine(x,y) = length(x) > length(y) ? x : y
combine (generic function with 4 methods)
julia> function combine_all(values)
ret = missing
for value in values
print("merging $ret \t with $value")
ret = combine(ret, value)
println("\t\t-------------> ", ret)
end
ret
end
combine_all (generic function with 1 method)
julia> combine_all([missing, missing, "abcde", missing, "ab", "longest", "short"])
merging missing with missing -------------> missing
merging missing with missing -------------> missing
merging missing with abcde -------------> abcde
merging abcde with missing -------------> abcde
merging abcde with ab -------------> abcde
merging abcde with longest -------------> longest
merging longest with short -------------> longest
"longest"
Re:
remove_cite_version
That needs some comments explaining why,
and I kind of feel we shouldn't do this -- that information is useful
I am guessing it is because the same generator, on different days, includes that information or doesn't include that information (I feel like I've seen that happen before). So not removing it breaks are test. If that is the case, then we should raise a bug upstream, that they are not returning consistent results. and we should do a workaround in the testing script only, to strip that string. Like how the github tests strips out all URLS
re: rule_dispatcher
I think you are thinking on the right lines but take a good look at what documentation and Stackoverflow posts, you can find on "Dispatch On Value" in julia, -- that is things like Val{:authors)
,
and maybe play around with solving a toy problem using it,
to get a good feel for it as a tool to solve problems.
It will make the code a fair bit cleaner.
Also the create_generators() will create Array of generators suitable for the purpose. Eg: If it is a DOI it makes sense to work with certain generators like DataCite, Figshare. There is no need to scan through all of the generators.
But why not scan through all the generators?
For some generators (e.g. JSON-LD) it is impossible to know if it will work without downloading the page. At which point it will throw an exception.
For others like all the ones that only eat DOIs, it will throw an exception if it doesn't match_doi
,
in which case it is basically takes the same amount of time as create_generators()
.
And we should be be able to use Async (or if it comes to it outright multiprocessing parallelism), to make checking extras take approximately the same wall-time as checking just one.
Merging #47 into master will increase coverage by
0.18%
. The diff coverage is81.42%
.
@@ Coverage Diff @@
## master #47 +/- ##
==========================================
+ Coverage 93.24% 93.43% +0.18%
==========================================
Files 18 18
Lines 370 411 +41
==========================================
+ Hits 345 384 +39
- Misses 25 27 +2
Impacted Files | Coverage Δ | |
---|---|---|
src/CKAN.jl | 100% <100%> (+5.88%) |
:arrow_up: |
src/utils.jl | 80.95% <100%> (+4.48%) |
:arrow_up: |
src/DataCite.jl | 100% <100%> (+4.34%) |
:arrow_up: |
src/JSONLD/JSONLD_Web.jl | 100% <100%> (+10%) |
:arrow_up: |
src/Figshare.jl | 100% <100%> (+3.7%) |
:arrow_up: |
src/JSONLD/JSONLD_DOI.jl | 100% <100%> (ø) |
:arrow_up: |
src/DataOneV2/DataOneV2.jl | 46.66% <46.66%> (-7.18%) |
:arrow_down: |
src/DataDepsGenerators.jl | 90.74% <87.17%> (-0.93%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 558fd00...a852b15. Read the comment docs.
Stick into utils.jl
"""
leaf_subtypes(T)
Returns all the nonabstract types decedent from `T`.
"""
function leaf_subtypes(T)
if isleaftype(T)
T
else
vcat(leaf_subtypes.(subtypes(T))...)
end
end
Into the main file for generation stick something like
function generate(repo::DataRepo, dataname; kwargs...)
generate([repo], dataname; kwargs...)
end
function generate(dataname; kwargs...)
all_repos = [T() for T in leaf_subtypes(DataRepo)]
generate(all_repos , dataname; kwargs...)
end
function generate(repos, dataname, shortname = nothing, show_failures=false)
retrieved_metadatas = MetaData[]
failures = Tuple{DataRepo, Exception}[]
# Get all the metadata we can
@sync for repo in repos
@async try
metadata = find_metadata(repo, dataname, shortname)
push!(retrieved_metadatas, metadata)
catch err
push!(failures, (repo, err))
end
end
# Displace errors if required
if length(retrieved_metadatas) == 0 || show_failures
for (repo, err) in failures
println(repo, " failed due to")
println(err)
println()
end
end
# merge them
aggregate(retrieved_metadatas)
end
I haven't tested this code. But hopefully you can get the idea.
Please consider and incorperate something like this. I believe this is the best way to handle calling multiple generators and allowing some to fail.
yes, it involved catching all exceptions. But that is ok at the top-level. It is an UX thing, since we are capturing them, and only displaying them if they matter (i.e. if all generators failed.)
Right the version I just pushed works. I disabled the Async. That is not the important part of this PR anyway. It may be an actual julia bug. If so, solving it is better to do after upgrading to julia 0.7. So lets leave it for now
Every thing else I said still need to be done still need to be done.
Also I know what is wrong with the JSON-LD for figshare and I've risen an issue. it can stay commented out for now.
Oh great! I was a bit confused too. Still I was reading up on https://docs.julialang.org/en/stable/manual/methods/index.html#man-methods-orthogonalize-1.
An excellent thing to be reading.
@oxinabox I am not sure, but the datatypes are not matching for some reason. Can you have a look?
PS: I am running only
include("test/CKAN.jl")
and only one generator for now without any async part, so as to debug and localize errors easily.Also, your reviews on the implementation will be really helpful.