Closed SebastinSanty closed 6 years ago
My thoughts are: #30 in general gives amazing results but, have various attributes missing. However all of them mention the source of the data explicitly. What if we have individual types like Figshare() and Dataverse() and a general JSONLD() which eventually gets the JSON-LD, parses the source, and passes it into one of the types which we have already integrated.
I had a perhaps slightly more ambitions thought along those lines. Or perhaps just a bit more silly.
Notion being to define a version of generate
that doesn't take a repo, and tries all of them (in parallel),
assesses the returned Metadata** of those that don't throw exceptions
for having the least missing fields and for having the important fields present (like download URL).
and selects the best.
Or maybe merges them.
Then generates the code.
Pseudocode follows:
function generate(dataname)
failures = Vector()
results = PriorityQueue()
asyncmap(all_subtypes(DataRepo)) do repotype
try
metadata = find_metadata(repotype(), dataname)
score = evaluate_metadata(metadata)
results[score] = metadata
catch err
# We want to remember the failure,
# so if we can't find any that work, we can explain what we tried,
# and people can make good bug reports if one of those should have worked
push!(failures, (repotype, err))
end
end
if length(results) > 0
# We got (at least) one
metadata = first(results) # get the best
generate_code(metadata)
else
# Got none
println("We were unable to find usable metadata for dataname")
println("We tried the following methods, and they failed for the reason listed")
println()
for (repo, err)
println("# $repo")
println(err)
println()
end
end
end
I think it is more general to just try everything and see what works and how well it does. Rather than rely on trying to use one API (which may or may not exist/be complete) to discover a better one. By running them asynchronously it should be pretty fast, as it won't have to wait for pages to download.
I guess you could call this the "Throw everything at the wall and see what sticks" method of finding the metadata.
(**we'ld probably need to add more optional fields to the Metadata type. Like all the kinda standard stuff that is currently in the free text description)
Merging #34 into master will increase coverage by
0.13%
. The diff coverage is94.44%
.
@@ Coverage Diff @@
## master #34 +/- ##
==========================================
+ Coverage 93.93% 94.07% +0.13%
==========================================
Files 14 15 +1
Lines 264 287 +23
==========================================
+ Hits 248 270 +22
- Misses 16 17 +1
Impacted Files | Coverage Δ | |
---|---|---|
src/DataOneV2/KNB.jl | 80% <ø> (ø) |
:arrow_up: |
src/CKAN.jl | 95% <ø> (ø) |
:arrow_up: |
src/DataOneV1.jl | 100% <ø> (ø) |
:arrow_up: |
src/DataDepsGenerators.jl | 94.59% <100%> (+3.16%) |
:arrow_up: |
src/DataCite.jl | 95.23% <100%> (-0.92%) |
:arrow_down: |
src/Figshare.jl | 93.33% <93.33%> (ø) |
|
src/generic_extractors.jl | 100% <0%> (ø) |
:arrow_up: |
... and 1 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 28d4559...561ef72. Read the comment docs.
Yes, your sounds more ambitious with the asynchronous part. Sounds interesting and I am up for it.
tries all of them (in parallel)
But even if I am trying to create them, wouldn't I require individual parts (like DataCite
, Figshare
) to be working. Wouldn't implementing all of them separately and then clubbing them together as you have noted be the better way to proceed?
Wouldn't implementing all of them separately and then clubbing them together as you have noted be the better way to proceed?
Yes, that is what I was meaning. Make them all separately and club together the ones that don't throw exceptions for a given identifier.
Yes, this is being done using Figshare API. I thought of doing OAI-PMH afterwards as you were more willing to get Figshare and Dataverse integrated.
My thoughts are: #30 in general gives amazing results but, have various attributes missing. However all of them mention the source of the data explicitly. What if we have individual types like
Figshare()
andDataverse()
and a generalJSONLD()
which eventually gets the JSON-LD, parses the source, and passes it into one of the types which we have already integrated.