scicloj / tablecloth

Dataset manipulation library built on the top of tech.ml.dataset
https://scicloj.github.io/tablecloth
MIT License
305 stars 27 forks source link
clojure dataframe dataset machinelearning

Tablecloth

Dataset (data frame) manipulation API for the tech.ml.dataset library

Versions

tech.ml.dataset 7.x (master branch)

tech.ml.dataset 4.x (4.0 branch)

[scicloj/tablecloth "4.04"]

Introduction

tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml stack.

I've started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.

During conversions of the examples I've come up how to reorganized existing tech.ml.dataset functions into simple to use API. The main goals were:

Important! This library is not the replacement of tech.ml.dataset nor a separate library. It should be considered as a addition on the top of tech.ml.dataset.

If you want to know more about tech.ml.dataset and dtype-next please refer their documentation:

Join the discussion on Zulip

Documentation

Please refer detailed documentation with examples.

The old documentation (till the end of 2023) is here.

Usage example

(require '[tablecloth.api :as tc])
(-> "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv"
    (tc/dataset {:key-fn keyword})
    (tc/group-by (fn [row]
                    {:symbol (:symbol row)
                     :year (tech.v3.datatype.datetime/long-temporal-field :years (:date row))}))
    (tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
    (tc/order-by [:symbol :year])
    (tc/head 10))

_unnamed [10 3]:

:symbol :year summary
AAPL 2000 21.74833333
AAPL 2001 10.17583333
AAPL 2002 9.40833333
AAPL 2003 9.34750000
AAPL 2004 18.72333333
AAPL 2005 48.17166667
AAPL 2006 72.04333333
AAPL 2007 133.35333333
AAPL 2008 138.48083333
AAPL 2009 150.39333333

Contributing

Tablecloth is open for contribution. The best way to start is discussion on Zulip.

Development tools for documentation

Documentation is written in the Kindly convention and is rendered using Clay composed with Quarto.

The old documentation was written in RMarkdown and is kept under docs/old/.

Documentation contains around 600 code snippets which are run during build. There are three relevant source files:

(notebooks/index.clj was generated by dev/conversion.clj from the earlier Rmarkdown-based index.Rmd with asome additional manual editing. Starting at 2024, it will diverge from that source, that will no longer be maintained.)

README generation

To generate README.md, run the generate! function at the dev/readme_generation.clj script.

Detailed documentation generation

To generate the detailed documentation, call the following. You will need the Quarto CLI installed in your system.

Currently (April 2024), we use Quarto's v1.5.10 pre-release (specifically this version, not the later ones) due to some Quarto bugs.

(require '[scicloj.clay.v2.api :as clay])
(clay/make! {:format [:quarto :html]
             :source-path "notebooks/index.clj"})

Code Generation

To build this project fully we need to perform some code generation operations. These are listed below:

  1. Build the tablecloth.api.operators namespace

    The tablecloth.api.operators namespace is generated by tablecloth.api.lift_operators. To build that namespace, you need to load the tablecloth.api.lift_operators namespace, and then execute the code surrounded by a comment at the bottom of the file.

  2. Build the tablecloth.api (aka the Dataset API)

    The tablecloth.api namespace is generated out of api-template. To build that namespace you need to load the tablecloth.api.api-template namespace, and then evaluate the code contained in the comment section at the bottom of the file. This will re-generate the tablecloth.api namespace.

  3. Build the tablecloth.column.api.operators namespace

    The tablecloth.column.api.operators namespace is generated by tablecloth.column.api.lift_operators. To build that namespace, you need to load the tablecloth.api.lift_operators namespace, and then execute the code surrounded by a comment at the bottom of the file.

  4. Build the tablecloth.column.api (aka the Column API)

    The tablecloth.column.api namespace is generated out of api-template. To build that namespace you need to load the tablecloth.column.api.api-template namespace, and then evaluate the code contained in the comment section at the bottom of the file. This will re-generate the tablecloth.column.api namespace.

Guideline

  1. Before commiting changes please perform tests. I ususally do: lein do clean, check, test and build documentation as described above (which also tests whole library).
  2. Keep API as simple as possible:
    • first argument should be a dataset
    • if parametrizations is complex, last argument should accept a map with not obligatory function arguments
    • avoid variadic associative destructuring for function arguments
    • usually function should working on grouped dataset as well, accept parallel? argument then (if applied).
  3. Follow potemkin pattern and import functions to the API namespace using tech.v3.datatype.export-symbols/export-symbols function
  4. Functions which are composed out of API function to cover specific case(s) should go to tablecloth.utils namespace.
  5. Always update README-source.md, CHANGELOG.md, notebooks/index.clj, tests and function docs are highly welcomed.
  6. Always discuss changes and PRs first

Tests

Tests are written and run using midje. To run a test, evaluate a midje form. If it passes, it will return true, if it fails details will be printed to the REPL.

TODO

Licence

Copyright (c) 2020 Scicloj

The MIT Licence