tecosaur / DataToolkit.jl

Reproducible, flexible, and convenient data management
https://tecosaur.github.io/DataToolkit.jl
78 stars 4 forks source link

Tutorial for adding a new data loader #37

Open jfb-h opened 5 months ago

jfb-h commented 5 months ago

As discussed in tecosaur/DataToolkitCommon.jl#10, here is a short docs writeup of the process of creating the Arrow loader as an example of how to add a loader to the package. Let me know what you think and of course feel free to adapt, extend or rephrase! (I would have made a PR but don't understand how the docs work).

Tutorial: Adding a new loader/writer

In case your favourite data format is not supported yet by DataToolkit, fret not! It is relatively straightforward to add a new loader to the package and PRs adding new loaders are welcome. The following will briefly outline the process based on the loader/writer for the arrow format.

Step 1: Adding the main loader / writer functions

Each loader has its own file in the src/transformers/saveload directory. So as a first step, we add a new file arrow.jl there and make sure to include('transformers/saveload/arrow.jl) in src/DataToolkitCommon.jl, next to the other loader files.

We're now ready to add methods to the main package functions responsible for loading and saving data, which are aptly called load and save. Starting with the loader, we add a method to load which dispatches on DataLoader{:arrow}, takes an IO and allows the specification of a sink type to read data into, e.g. a DataFrame. The final function looks as follows:

function load(loader::DataLoader{:arrow}, io::IO, sink::Type)
    @import Arrow
    convert = @getparam loader."convert"::Bool true
    result = Arrow.Table(io; convert) |>
    if sink == Any || sink == Arrow.Table
        identity
    elseif QualifiedType(sink) == QualifiedType(:DataFrames, :DataFrame)
        sink
    end
    result
end

This function includes four things:

  1. An @import statement for the Arrow package which we use for reading a .arrow file
  2. Use of the @getparam macro to obtain arguments to the wrapped loader function (Arrow.Table, in our case) from the Data.toml file and to set their defaults. Here, we just need to specify the single convert argument, but in principle, there can be many.
  3. Reading the data from io, most likely using a package and including the arguments obtained in step 2 (here: Arrow.Table(io; convert)).
  4. Conversion to the specified sink type. Note the use of QualifiedType, which needs to be specified separately.

The file types supported by the loader and resolved in step 4 are specified through inclusion of a method for the supportedtypes function. Here, we specify two possible return types: Arrow.Table, which is returned natively by the Arrow.jl package, and DataFrame from the DataFrames.jl package:

supportedtypes(::Type{DataLoader{:arrow}}) =
    [QualifiedType(:DataFrames, :DataFrame),
     QualifiedType(:Arrow, :Table)]

The writer follows an overall similar structure; @import necessary packages, obtain writer arguments using @getparam and then write the data in tbl to io. Here's the save method for the arrow loader:

function save(writer::DataWriter{:arrow}, io::IO, tbl)
    @import Arrow
    compress         = @getparam writer."compress"::Union{Symbol, Nothing} nothing
    alignment        = @getparam writer."alignment"::Int 8
    dictencode       = @getparam writer."dictencode"::Bool false
    dictencodenested = @getparam writer."dictencodenested"::Bool false
    denseunions      = @getparam writer."denseunions"::Bool true
    largelists       = @getparam writer."largelists"::Bool false
    maxdepth         = @getparam writer."maxdepth"::Int 6
    ntasks           = @getparam writer."ntasks"::Int Int(typemax(Int32))
    Arrow.write(
        io, tbl;
        compress, alignment,
        dictencode, dictencodenested,
        denseunions, largelists,
        maxdepth, ntasks)
end

We also need to add a method to th ecreate function for our loader with a regex to recognize files of our data format:

create(::Type{DataLoader{:arrow}}, source::String) =
    !isnothing(match(r"\.arrow$"i, source))

...and a method to createpriority specifying... TODO: what exactly?

createpriority(::Type{DataLoader{:arrow}}) = 10

Finally, we add a docstring specifying how to use our loader/writer:

const ARROW_DOC = md"""
[...]
"""

That's the full content of the new arrow.jl file!

Step 2: Adding the new loader to package initialization

To make things work, we now just need to add two more things to the __init__() function in src/DataToolkitCommon.jl:

  1. A line specifying the necessary packages used by our loader with their respective UUIDs (which you can obtain from their respective Project.toml): In our case that is @addpkg Arrow "69666777-d1a9-59fb-9406-91d4454c9d45"
  2. A line adding our docstring to the package documentation. In our case, we just add (:loader, :arrow) => ARROW_DOC, to the list of docstrings in the append!(DataToolkitBase.TRANSFORMER_DOCUMENTATION, ...) call further below.