Dataset (data frame) manipulation API for the tech.ml.dataset library
[scicloj/tablecloth "4.04"]
tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml
stack.
I've started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.
During conversions of the examples I've come up how to reorganized existing tech.ml.dataset
functions into simple to use API. The main goals were:
tech.ml
like pipelines, datatypes, readers, ML, etc.group-by
results with special kind of dataset - a dataset containing subsets created after grouping as a column.Important! This library is not the replacement of tech.ml.dataset
nor a separate library. It should be considered as a addition on the top of tech.ml.dataset
.
If you want to know more about tech.ml.dataset
and dtype-next
please refer their documentation:
Join the discussion on Zulip
Please refer detailed documentation with examples.
The old documentation (till the end of 2023) is here.
(require '[tablecloth.api :as tc])
(-> "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
(tc/dataset {:key-fn keyword})
(tc/group-by (fn [row]
{:symbol (:symbol row)
:year (tech.v3.datatype.datetime/long-temporal-field :years (:date row))}))
(tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
(tc/order-by [:symbol :year])
(tc/head 10))
_unnamed [10 3]:
:symbol | :year | summary |
---|---|---|
AAPL | 2000 | 21.74833333 |
AAPL | 2001 | 10.17583333 |
AAPL | 2002 | 9.40833333 |
AAPL | 2003 | 9.34750000 |
AAPL | 2004 | 18.72333333 |
AAPL | 2005 | 48.17166667 |
AAPL | 2006 | 72.04333333 |
AAPL | 2007 | 133.35333333 |
AAPL | 2008 | 138.48083333 |
AAPL | 2009 | 150.39333333 |
Tablecloth
is open for contribution. The best way to start is discussion on Zulip.
Documentation is written in the Kindly convention and is rendered using Clay composed with Quarto.
The old documentation was written in RMarkdown and is kept under docs/old/.
Documentation contains around 600 code snippets which are run during build. There are three relevant source files:
(notebooks/index.clj
was generated by dev/conversion.clj from the earlier Rmarkdown-based index.Rmd
with asome additional manual editing. Starting at 2024, it will diverge from that source, that will no longer be maintained.)
To generate README.md
, run the generate!
function at the dev/readme_generation.clj script.
To generate the detailed documentation, call the following. You will need the Quarto CLI installed in your system.
Currently (April 2024), we use Quarto's v1.5.10 pre-release (specifically this version, not the later ones) due to some Quarto bugs.
(require '[scicloj.clay.v2.api :as clay])
(clay/make! {:format [:quarto :html]
:source-path "notebooks/index.clj"})
To build this project fully we need to perform some code generation operations. These are listed below:
Build the tablecloth.api.operators
namespace
The tablecloth.api.operators
namespace is generated by
tablecloth.api.lift_operators
. To build that namespace, you need to
load the tablecloth.api.lift_operators
namespace, and then execute
the code surrounded by a comment at the bottom of the file.
Build the tablecloth.api
(aka the Dataset API)
The tablecloth.api
namespace is generated out of api-template
. To
build that namespace you need to load the
tablecloth.api.api-template
namespace, and then evaluate the code
contained in the comment section at the bottom of the file. This will
re-generate the tablecloth.api
namespace.
Build the tablecloth.column.api.operators
namespace
The tablecloth.column.api.operators
namespace is generated by
tablecloth.column.api.lift_operators
. To build that namespace, you
need to load the tablecloth.api.lift_operators
namespace, and then
execute the code surrounded by a comment at the bottom of the file.
Build the tablecloth.column.api
(aka the Column API)
The tablecloth.column.api
namespace is generated out of
api-template
. To build that namespace you need to load the
tablecloth.column.api.api-template
namespace, and then evaluate the
code contained in the comment section at the bottom of the file. This
will re-generate the tablecloth.column.api
namespace.
lein do clean, check, test
and build documentation as described above (which also tests whole library).parallel?
argument then (if applied).potemkin
pattern and import functions to the API namespace using tech.v3.datatype.export-symbols/export-symbols
functiontablecloth.utils
namespace.README-source.md
, CHANGELOG.md
, notebooks/index.clj
, tests and function docs are highly welcomed.Tests are written and run using midje. To run a test, evaluate a midje form. If it passes, it will return true
, if it fails details will be printed to the REPL.
Copyright (c) 2020 Scicloj
The MIT Licence