tecosaur / DataToolkit.jl

Reproducible, flexible, and convenient data management
https://tecosaur.github.io/DataToolkit.jl
78 stars 4 forks source link

Documentation on how to add a new transformer #17

Open ansaardollie opened 3 months ago

ansaardollie commented 3 months ago

Hi there,

Just wondering if you could provide documentation on how to go about creating custom storage,loaders and writers. The documentation just says to implement the required functions, however when I use the following code

function storage(provider::DataStorage{:rijl}, ::Type{IO}; write::Bool)
    println("Inside storage(provider::DataStorage{:rijl}, ::Type{IO}; write::Bool)")
    qry = getquery(provider)

    fn = createname(qry)

    # cache_dir = DataToolkitBase.config_get("cache_dir")
    cache_dir = "./cache"

    tpath = joinpath(cache_dir, fn)

    if isfile(tpath) && write
        cached_bytes = open(tpath, "w")
        return cached_bytes
    elseif isfile(tpath) && !(write)
        cached_bytes = open(tpath, "r")
        return cached_bytes
    else
        return run_query_and_cache(qry, tpath; write=write)
    end

end
function load(loader::DataLoader{:rijl}, source::Type{IO}, as::Type{DataFrame})
    println("Inside load(loader::DataLoader{:rijl}, source::Type{IO}, as::Type{DataFrame})")
    pds = Parquet2.Dataset(source)
    abf = IOBuffer()

    Arrow.write(abf, pds)

    abb = take!(abf)

    df = Arrow.Table(abb) |> DataFrame

    return df
end

And then use the following dataset in the Data.toml file

data_config_version = 0
uuid = "f812338f-4069-46dc-8bb8-dba7cb5e1ae5"
name = "RiData"
plugins = ["store", "defaults", "memorise"]

[[Test]]
uuid = "3fb5d56a-63d2-4474-b4ae-4d824a2d6b2a"

[[Test.storage]]
driver = "rijl"
type = "DataStorage{:rijl}"
query = "SELECT * FROM TABLE"

[[Test.loader]]
driver = "rijl"
type = "DataStorage{:rijl}"

I get the following error


ERROR: UnsatisfyableTransformer: There are no storages for "Test" that can provide a .
 The defined storages are as follows:
   DataStorage{rijl}(DataStorage{:rijl})

Please can you help me, I really love the idea of this package and want to incorporate it into a few different data pipelines but I cannot seem to get the basics down.

tecosaur commented 2 months ago

Hi @ansaardollie, sorry for the delay but I'd be happy to help!

The main problem I see with the code you've shared, is what you've set the type parameter to in the TOML. It should be set to the Julia type of the information produced by the loader/storage backend, e.g. IO, String, DataFrame. When there's only one option, you can just omit it entirely too.