As discussed in tecosaur/DataToolkitCommon.jl#10, here is a short docs writeup of the process of creating the Arrow loader as an example of how to add a loader to the package. Let me know what you think and of course feel free to adapt, extend or rephrase! (I would have made a PR but don't understand how the docs work).
Tutorial: Adding a new loader/writer
In case your favourite data format is not supported yet by DataToolkit, fret not! It is relatively straightforward to add a new loader to the package and PRs adding new loaders are welcome. The following will briefly outline the process based on the loader/writer for the arrow format.
Step 1: Adding the main loader / writer functions
Each loader has its own file in the src/transformers/saveload directory. So as a first step, we add a new file arrow.jl there and make sure to include('transformers/saveload/arrow.jl) in src/DataToolkitCommon.jl, next to the other loader files.
We're now ready to add methods to the main package functions responsible for loading and saving data, which are aptly called load and save. Starting with the loader, we add a method to load which dispatches on DataLoader{:arrow}, takes an IO and allows the specification of a sink type to read data into, e.g. a DataFrame. The final function looks as follows:
function load(loader::DataLoader{:arrow}, io::IO, sink::Type)
@import Arrow
convert = @getparam loader."convert"::Bool true
result = Arrow.Table(io; convert) |>
if sink == Any || sink == Arrow.Table
identity
elseif QualifiedType(sink) == QualifiedType(:DataFrames, :DataFrame)
sink
end
result
end
This function includes four things:
An @import statement for the Arrow package which we use for reading a .arrow file
Use of the @getparam macro to obtain arguments to the wrapped loader function (Arrow.Table, in our case) from the Data.toml file and to set their defaults. Here, we just need to specify the single convert argument, but in principle, there can be many.
Reading the data from io, most likely using a package and including the arguments obtained in step 2 (here: Arrow.Table(io; convert)).
Conversion to the specified sink type. Note the use of QualifiedType, which needs to be specified separately.
The file types supported by the loader and resolved in step 4 are specified through inclusion of a method for the supportedtypes function. Here, we specify two possible return types: Arrow.Table, which is returned natively by the Arrow.jl package, and DataFrame from the DataFrames.jl package:
The writer follows an overall similar structure; @import necessary packages, obtain writer arguments using @getparam and then write the data in tbl to io. Here's the save method for the arrow loader:
...and a method to createpriority specifying... TODO: what exactly?
createpriority(::Type{DataLoader{:arrow}}) = 10
Finally, we add a docstring specifying how to use our loader/writer:
const ARROW_DOC = md"""
[...]
"""
That's the full content of the new arrow.jl file!
Step 2: Adding the new loader to package initialization
To make things work, we now just need to add two more things to the __init__() function in src/DataToolkitCommon.jl:
A line specifying the necessary packages used by our loader with their respective UUIDs (which you can obtain from their respective Project.toml): In our case that is @addpkg Arrow "69666777-d1a9-59fb-9406-91d4454c9d45"
A line adding our docstring to the package documentation. In our case, we just add (:loader, :arrow) => ARROW_DOC, to the list of docstrings in the append!(DataToolkitBase.TRANSFORMER_DOCUMENTATION, ...) call further below.
As discussed in tecosaur/DataToolkitCommon.jl#10, here is a short docs writeup of the process of creating the Arrow loader as an example of how to add a loader to the package. Let me know what you think and of course feel free to adapt, extend or rephrase! (I would have made a PR but don't understand how the docs work).
Tutorial: Adding a new loader/writer
In case your favourite data format is not supported yet by
DataToolkit
, fret not! It is relatively straightforward to add a new loader to the package and PRs adding new loaders are welcome. The following will briefly outline the process based on the loader/writer for thearrow
format.Step 1: Adding the main loader / writer functions
Each loader has its own file in the
src/transformers/saveload
directory. So as a first step, we add a new filearrow.jl
there and make sure toinclude('transformers/saveload/arrow.jl)
insrc/DataToolkitCommon.jl
, next to the other loader files.We're now ready to add methods to the main package functions responsible for loading and saving data, which are aptly called
load
andsave
. Starting with the loader, we add a method toload
which dispatches onDataLoader{:arrow}
, takes anIO
and allows the specification of a sink type to read data into, e.g. aDataFrame
. The final function looks as follows:This function includes four things:
@import
statement for theArrow
package which we use for reading a.arrow
file@getparam
macro to obtain arguments to the wrapped loader function (Arrow.Table
, in our case) from theData.toml
file and to set their defaults. Here, we just need to specify the singleconvert
argument, but in principle, there can be many.io
, most likely using a package and including the arguments obtained in step 2 (here:Arrow.Table(io; convert)
).QualifiedType
, which needs to be specified separately.The file types supported by the loader and resolved in step 4 are specified through inclusion of a method for the
supportedtypes
function. Here, we specify two possible return types:Arrow.Table
, which is returned natively by theArrow.jl
package, andDataFrame
from theDataFrames.jl
package:The writer follows an overall similar structure;
@import
necessary packages, obtain writer arguments using@getparam
and then write the data intbl
toio
. Here's thesave
method for the arrow loader:We also need to add a method to th e
create
function for our loader with a regex to recognize files of our data format:...and a method to
createpriority
specifying... TODO: what exactly?Finally, we add a docstring specifying how to use our loader/writer:
That's the full content of the new
arrow.jl
file!Step 2: Adding the new loader to package initialization
To make things work, we now just need to add two more things to the
__init__()
function insrc/DataToolkitCommon.jl
:Project.toml
): In our case that is@addpkg Arrow "69666777-d1a9-59fb-9406-91d4454c9d45"
(:loader, :arrow) => ARROW_DOC,
to the list of docstrings in theappend!(DataToolkitBase.TRANSFORMER_DOCUMENTATION, ...)
call further below.