tecosaur / DataToolkit.jl

Reproducible, flexible, and convenient data management
https://tecosaur.github.io/DataToolkit.jl
87 stars 4 forks source link

Convenience for a script processing datasets into new dataset #44

Open Datseris opened 3 months ago

Datseris commented 3 months ago

Hi there,

inspired by the existing functionality of "generating a new dataset from a given pipeline" described here: https://tecosaur.github.io/DataToolkit.jl/main/tutorial/#Cleaning-the-data , and also by the DrWatson.produce_or_load functionality, I have a proposal for something that is sorts of the merging of the two.

In my work I produce a "derived" dataset, similar to what is done in the DataToolkit.jl tutorial. However, I am searching for a minimally invasive way to transform the following script, into something DataToolkit.jl compatible. Let's say I have this:

using PkgA, PkgB, ...

X = load(dataset1_path)
Y = load(dataset2_path)
...

W = product_new_dataset_from_others(X, Y, ...)

save(datasetW_path)

How do I leverage DataToolkit.jl, so that the dataset W is re-created on-demand only when any of the input datasets is modified, given this script? Let's assume that I have already transformed X, Y, ... already into DataToolkit.jl data entries, as it is clear from the docs how to do this.

tecosaur commented 5 days ago

It's not exactly what we talked about, but this seems like a good place to note that the API in v0.10 is beginning to make programmatic DataSet creation feasible without being an ugly mess.

Sample

const REGISTRY_URL = "https://pkg.julialang.org"

const regdata = loadcollection!(joinpath(@__DIR__, "RegistryData.toml"))
const pkgsources = dataset(regdata, "PackageSources") |> read

const pkgdata = DataCollection("PackageData", plugins = ["defaults", "store"])

for (name, uuid, url, hash) in pkgsources
    pkgfiles = create!(pkgdata, DataSet, name, "description" => "The source files of the package $name.")
    storage!(pkgfiles, :web, "url" => "$REGISTRY_URL/package/$uuid/$hash")
    loader!(pkgfiles, :chain, "loaders" => ["gzip", "tar"])
    pkgstrs = create!(pkgdata, DataSet, name * " strings",
                      "description" => "The strings extracted from the source files of the package $name.")
    storage!(pkgstrs, :passthrough,
             "source" => string(Identifier(pkgfiles)),
             "type" => Dict{String, IO})
    loader!(pkgstrs, :julia,
            "input" => Dict{String, IO},
            "path" => "Data.d/extract_string.jl",
            "type" => Vector{String})
end

write(joinpath(@__DIR__, "PackageData.toml"), pkgdata)
Datseris commented 5 days ago

thanks, perhaps you can attach a text description of what the script does?

tecosaur commented 4 days ago

Sure! That's taking a list of pkgsources (gzip'd julia package source tarballs), and for each one generating a DataSet for the untar-d content, and another DataSet for all the strings in that package, for example:

(RegistryData) data> stack list 
 #  Name          Datasets  Writable  Plugins                         
 ─────────────────────────────────────────────────────────────────────
 1  RegistryData  2         yes       cache, defaults, memorise, store
 2  PackageData   19849     yes       defaults, store                 

julia> d"DrWatson strings"
1049-element Vector{String}:
 "<NAME-PLACEHOLDER>"
 "dummy_src_file.jl"
 "\nCurrently active project is: \$" ⋯ 192 bytes ⋯ "ening your own Pull Requests!\n"
 "double"
 "a=0.1535_b=5_mode=double"
 "n_a=0.153_b=5_mode=double"
 "n"
 ⋮
 "."
 ""
 "jld2"
 "tmp"
 "_research"
 "\n    tmpsave(dicts::Vector{Dict" ⋯ 635 bytes ⋯ " to wsave (e.g. compression).\n"
Sample of the generated Data TOML ```toml [[DrWatson]] uuid = "c36fd30f-9fa2-469d-8eb2-3a5f86ad49a6" description = "The source files of the package DrWatson." [[DrWatson.storage]] driver = "web" url = "https://pkg.julialang.org/package/634d3b9d-ee7a-5ddf-bec9-22491ea816e1/32704fb48e1ecd3739d5018df35282237b823f0a" [[DrWatson.loader]] driver = "chain" loaders = ["gzip", "tar"] [["DrWatson strings"]] uuid = "9d4121b5-e042-40bd-839e-631dfb4f7a31" description = "The strings extracted from the source files of the package DrWatson." [["DrWatson strings".storage]] driver = "passthrough" source = "PackageData:DrWatson" type = "Dict{String,IO}" [["DrWatson strings".loader]] driver = "julia" input = "Dict{String,IO}" path = "Data.d/extract_string.jl" type = "Array{String,1}" ```

I think a layer of convenience on top of this that gets us closer to produce_or_load might be something like:

@jldataset "Name" function(a = d"input1", b = d"input2")::Int
   a * b # (pure) code that produces the result
end

Allowing d"Name" to be used in subsequent code.