tecosaur / DataToolkit.jl

Reproducible, flexible, and convenient data management
https://tecosaur.github.io/DataToolkit.jl
78 stars 4 forks source link

REPL not identifying custom transformers #43

Open ansaardollie opened 3 weeks ago

ansaardollie commented 3 weeks ago

Hi there,

I am trying to implement my own custom transformer named customtransformer. However when I try to run the ?: command in the REPL it doesn't pick these transformers up.

I have a file called custom_transformers.jl which has the following content

using DataToolkitBase

function storage(provider::DataStorage{:customtransformer}, ::Type{IO}; write::Bool)
    ...
end

function load(provider::DataLoader{:customtransformer}, io::IO, sink::Type)
    ...
end

function save(provider::DataStorage{:customtransformer}, io::IO, tbl)
    ...
end

I execute this file in the current Julia session and then run

using DataToolkit

Then when trying to list the transformers (using the ?: command in the DataRepl) my custom transformer never shows up.

What is the process to let DataToolkit.jl know about these custom transformers.

tecosaur commented 3 weeks ago

Currently the transformer list command only knows about transformers it's been explicitly told about, see what's currently done in DataToolkitCommon:

https://github.com/tecosaur/DataToolkit.jl/blob/dc8280f7b3aa35f7b3b2264441ac1ba2952cebe1/Common/src/DataToolkitCommon.jl#L101-L113

(NB: DataToolkitBase has been renamed to DataToolkitCore in the development version)

I don't currently see a nicer way of fetching the documentation, but I think I could probably check for undocumented transformers and mention them at the end of ?:, how does that sound?

tecosaur commented 3 weeks ago

I'm also planning on improving the docs a bit to make this a bit easier/soften the learning curve :slightly_smiling_face:

ansaardollie commented 3 weeks ago

Hi

Completely understand regarding the documentation for the repl. No worries, I've realized I've mis-explained the real issue.

Out of interest, have there been any major changes between v0.9.x to v0.10? I ask because my initial thought in trying to get a handle of how everything works was just to try and get dummy transformers working and see if the toolkit could recognize them. However I've since realized at least for the system to pick up the driver name's in Data.toml; however I keep getting errors along the lines of:

UnsatisfyableTransformer: There are no storages for "cars" that can provide a .
 The defined storages are as follows:
   DataStorage{web}(IO)

I am trying to implement a Parquet driver, however I get issues as above. My basic approach thus far been to create a Julia package and then inside there define all the loader logic (which is the only transformer I've actually needed to use since I can get the parquet files through https).

I've tried following the approach of the example on this page

My package file src/dtk_data.jl

module dtk_data
using DataToolkit, DataToolkitBase, DataToolkitCommon, DataFrames

export load, supportedtypes, create

function __init__()
    @addpkg Parquet2 "98572fba-bba0-415d-956f-fa77e587d26d"
    @addpkg DataFrames "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
end

function load(loader::DataLoader{:parquet}, io::IO, ::Type{DataFrame})
    @import Parquet2
    @import DataFrames
    return Parquet2.Dataset(io) |> DataFrames.DataFrame
end

supportedtypes(::Type{DataLoader{:parquet}}) =
    [QualifiedType(:DataFrames, :DataFrame)]

create(::Type{DataLoader{:parquet}}, source::String) =
    !isnothing(match(r"\.parquet$"i, source))

end # module dtk_data

Then I open julia session in the root directory of this package and run the following code

include("src/dtk_data.jl")

using .dtk_data

using DataToolkit

loadcollection!("Data.toml")

d"cars"

And my Data.toml has the following setup

data_config_version = 0
uuid = "74641622-11fb-438b-b7be-4626639b8eac"
name = "dtk_data"
plugins = ["store", "defaults", "memorise"]

[[cars]]
uuid = "a6cee431-bfa1-4690-b8f3-51de93d970f5"

    [[cars.storage]]
    url = "https://github.com/ansaardollie/dtk_data/raw/main/MT%20cars.parquet"
    type = "Base.IO"
    driver = "web"

    [[cars.loader]]
    driver = "parquet"
    type = "DataFrames.DataFrame"  

Then I get the following error

ERROR: UnsatisfyableTransformer: There are no storages for "cars" that can provide a .
 The defined storages are as follows:
   DataStorage{web}(IO)
Stacktrace:
  [1] _read(dataset::DataToolkitBase.DataSet, as::Type)
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\interaction\externals.jl:253
  [2] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})   
    @ Base .\essentials.jl:887
  [3] invokelatest(::Any, ::Any, ::Vararg{Any})
    @ Base .\essentials.jl:884
  [4] invokepkglatest(::Any, ::Any, ::Vararg{Any}; kwargs::@Kwargs{})
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\usepkg.jl:101
  [5] invokepkglatest(::Any, ::Any, ::Vararg{Any})
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\usepkg.jl:100
  [6] (::DataToolkitBase.AdviceAmalgamation)(::Function, ::Any, ::Vararg{Any}; kwargs...)
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\advice.jl:102
  [7] (::DataToolkitBase.AdviceAmalgamation)(::Function, ::Any, ::Vararg{Any})
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\advice.jl:98
  [8] macro expansion
    @ ~\.julia\packages\DataToolkitBase\LJn9B\src\model\advice.jl:131 [inlined]
  [9] _dataadvisecall(::typeof(DataToolkitBase._read), ::DataToolkitBase.DataSet, ::Type{…}; kwargs::@Kwargs{})
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\model\advice.jl:131
 [10] read(dataset::DataToolkitBase.DataSet)
    @ DataToolkitBase ~\.julia\packages\DataToolkitBase\LJn9B\src\interaction\externals.jl:160
 [11] macro expansion
    @ ~\.julia\packages\DataToolkit\VObGv\src\DataToolkit.jl:48 [inlined]
 [12] top-level scope
    @ REPL[5]:1
Some type information was truncated. Use `show(err)` to see complete types.

Any help would be appreciated. Would love to be able get a parquet driver working so I can hopefully contribute if you'd like.

tecosaur commented 3 weeks ago

Out of interest, have there been any major changes between v0.9.x to v0.10?

Yup! I'm making a few major changes (a changelog probably wouldn't hurt :sweat_smile:), such as:


Regarding the problem you've run into, it looks like you've given enough info for it to be a MWE. I'll see if I can give it a look in the next day or two, otherwise I'll probably get to it on the weekend :slightly_smiling_face:.

tecosaur commented 3 weeks ago

however I keep getting errors along the lines of:

Good news, this error message is improved in 0.10-dev :slightly_smiling_face:

UnsatisfyableTransformer: There are no loaders for "cars" that can provide a DataFrames.DataFrame.

More good news, I think you'll find this works if you actually import the functions you want to overload

- export load, supportedtypes, create
+ import DataToolkitBase: load, supportedtypes, create

It would be great to see a Paraquet driver, I should have some docs on adding a loader to DataToolkitCommon in the next week or so.

ansaardollie commented 3 weeks ago

Awesome thank you so much for update.

Out of interest how would one add the v0.10-dev of the packages using the monorepo link to my Julia environment ?