New areas in cl-ana: in-memory analysis/summarizing, and data pre-processing

kat-co commented 5 years ago

I am preparing to propose some functionality I have copied from numpy, and pandas, but I'm unsure which packages they belong in. I believe this may be new areas of functionality for cl-ana, and I need the advice of someone much more familiar with cl-ana, data science, and machine learning, than I currently am.

New Area: Analysis & summarization

From afar (I have yet to sit down and fully process it), DOP appears to be great at minimizing the passes over data to get the results the user has declared. However, in theory, a user could be exploring data in such a way that a columnar view of the data might minimize the number of passes. In such a case, a user may not know enough about the data to write all the declarative cases up front, and so DOP might do a minimal number of passes, locally, but not globally, as the user adds declarations. A columnar table would transpose the table so that any column-oriented operations would be globally minimal as the user thinks of new ways to poke at the data.

Some such operations I've brought over from pandas are: summarize (populated counts, and types, for all fields), value-counts (counts distinct values for a field), correlation-matrix (creates a matrix of the correlation coefficient between all columns). There are other, useful, summarizing functions we can take from pandas.

The thing these functions seem to have in common is that they summarize all fields at a high level to allow users to get a "feel" for the data before doing proper analysis. From what I gather, most users expect these operations to be very fast.

Should these live in cl-ana.summarization?

A more performant in-memory representation?

Several of the summarization operations would best be done on tables that are in-memory when possible. I think there are probably quite a few data sets out there that could be held in memory. We currently have plist-table, and that might be good enough. However, we might be able to come up with a more performant version based on multi-dimensional arrays, and have the customary current-row accessor simply be a tuple of (row col). This might be much faster, and still easy to understand. I'm not sure if this is warranted yet, but it's an idea.

New Area: Preprocessing & Data Munging

I haven't yet written any functions to mirror pandas functions. From what I understand, the step prior to training ML models is coercing the data into a shape, and corpus, that is conducive to the ML model you'd like to use. This involves dropping columns, transforming values of columns from strings to numeric values, etc. I don't think this is actual machine learning, so I think it might be a good fit for living in cl-ana.

Should these live in cl-ana.transform?

Mirroring popular python data science namespaces?

Tensorflow has a namespace for Keras which mirrors the Keras project's API, but all of the operations are in terms of Tensorflow primitives. This allows communities which were very familiar with Keras to work seamlessly with Tensorflow. It would be a bit strange to follow suit since we would be transcending the language barrier as well, but would we want a facade on cl-ana operations in terms of popular data science libraries? E.g. cl-ana.pandas?

What are the boundaries of cl-ana?

cl-ana has cl-ana.statistical-learning, but it is not my impression that it is trying to be a machine learning library unto itself. But what are its boundaries? And how could it best interoperate with other ecosystems?

My current understanding is that using machine learning involves ~5 stages:

Data Retrieval
Data Exploration
Data Preprocessing/Munging
Model Training
Operationalizing the Model

I was planning on using Common Lisp, and cl-ana for steps 1-3, and then feeding the data to other ecosystems after that, e.g. Tensorflow.

When planning out the packages and functionality cl-ana, it might be helpful to have a clear idea how cl-ana might interoperate with other tooling.

EDIT:

I forgot this! I did some analysis of some popular machine learning packages to see what types of namespaces they expose. It may be helpful to consider the shape of these other packages when considering where to place things in cl-ana.

scikit-learning
- base
- calibration
- cluster
- biculture
- compose
- covariance
- cross_decomposition
- datasets
- decomposition
- discrimination_analysis
- dummy
- ensemble
- exceptions
- feature_extraction
- feature_selection
- gaussian_process
- isotonic
- impute
- kernel_aproximation
- kernel_ridge
- linear_model
- manifold
- metrics
- mixture
- model_selection
- multiclass
- multioutput
- naive_bayes
- neighbors
- neural_network
- pipeline
- preprocessing
- random_projection
- semi_supervised
- svm
- tree
- utils
keras
- activations
- applications
- backend
- callbacks
- constraints
- datasets
- estimator
- experimental
- initializers
- layers
- losses
- metrics
- models
- optimizers
- preprocessing
- sequence
- text
- image
- regularizers
- utils
- wrappers
tensorflow
- app
- autograph
- bitwise
- compat
- contrib
- data
- debugging
- distribute
- distributions
- dtypes
- errors
- estimator
- experimental
- feature_column
- gfile
- graph_util
- image
- initializers
- io
- keras
- layers
- linalg
- lite
- logging
- losses
- manip
- math
- metrics
- nn
- profiler
- python_io
- quantization
- queue
- ragged
- random
- resource_loader
- saved_model
- sets
- signal
- sparse
- spectral
- strings
- summary
- sysconfig
- test
- train
- version
pytorch
- Tensor
- sparse
- cuda
- Storage
- nn
- functional
- init
- optim
- autograd
- distributed
- distributions
- hub
- jit
- multiprocessing
- utils
- bottleneck
- checkpoint
- cpp_extension
- data
- dlpack
- model_zoo
- tensorboard
- onnx

ghollisjr commented 5 years ago

On the performance argument: As long as an entire column can fit into memory, it is true that there is a significant performance gain by allowing column-wise operations/transposing the data set. So I think that the column-wise functionality might as well assume that the data fits into memory, since if it doesn't then one would need even more sophistication than what is already built into the table classes that are implemented in cl-ana. This is yet another reason to make sure there is an efficient in-memory table type along with utility functions for pivoting/transposing, random access, subset selection, etc.

ghollisjr commented 5 years ago

On the jurisdiction of cl-ana: I think it would be good to have utility functions that generate whatever is convenient for sending results to as many other frameworks as possible. I think that will necessarily be a process of getting requests from people who are using those frameworks and then adding the utility functions to cl-ana over time.

ghollisjr commented 5 years ago

As for the cl-ana.statistical-learning, my naive idea was that any common statistical learning procedures could be added over time; I started with linear least-squares and a handful of other commonly used fitting or clustering techniques and would be happy to have more added.

ghollisjr commented 5 years ago

On pre-processing and data munging: I have actually been using cl-ana to do this for the physics research, and I think there will be a conflict between using DOP or a standard function/macro approach. Using DOP, removing columns isn't necessary so long as the underlying table type doesn't read data from disk inefficiently, and if the data is in memory then there isn't a real need to create a new table object that will duplicate some subset of data.

I can imagine implementing in-memory table types such that they either contain source data or contain row indices to source data from other tables along with new fields or logical fields for each row.

From what I've seen in statistical learning material, dropping fields is often done conceptually, and there is a difference in the way data is treated based on whether it is maintained in a database system or just stored in files directly. If data is stored in a file, then dropping columns is a conceptual operation that might only affect the model portion of code. If it's maintained by some database system, like MySQL, then new tables might be created in the database manager system, although what that means in terms of how files are managed is necessarily mysterious.

I have thought about trying to extend cl-ana to allow convenient access to SQL tables, but I haven't been able to think of a way to integrate it with the rest of the table functionality.

kat-co commented 5 years ago

On pre-processing and data munging: I have actually been using cl-ana to do this for the physics research, and I think there will be a conflict between using DOP or a standard function/macro approach.

I think I've arrived at an insight while trying to understand how DOP works.

DSLs can be powerful abstractions for performing operations within a domain. The downside is that newcomers need to learn that DSL to get anything done. And if the DSL is broad enough, or different enough from the host language, it is usually labelled "difficult". The exception to this rule are libraries that have benefited from such widespread adoption, that despite these idiosyncrasies, the DSL is successful. The DSL itself becomes it's own domain which lots of people have chosen to master.

So far what I see in DOP is not wildly different from Common Lisp, or difficult to understand.

I was thinking maybe cl-ana could support two modes of operation: incremental, imperative style work, and declarative sat-solver type work (i.e. DOP). Then, the promise cl-ana could make to its users is that the imperative style (using functions and methods) and the declarative style (using macros and its DSL) would be joined by a common set of symbols; if you learn one style, you can utilize the other style without relearning anything.

For example, as far as I can tell, there is no imperative version of ltab. It would be nice if users could learn cl-ana, play with ltab on a small, in-memory data set, and then once they're familiar with cl-ana's vocabulary, and need the power of DOP, could "graduate" to that style of doing things and reap the benefits.

I'm not sure if this is actually insightful or not :) What do you think?

kat-co commented 5 years ago

Ruminating on this further: is there a way to unify the two styles? It looks like DOP might be looking for key symbols to determine when table reductions are being performed. Could those key symbols be the same symbols for the imperative style of programming?

E.g.: I have defined a function table-split-randomly to help split a table into multiple data sets: one for training, and one for testing. This function uses another function iloc which returns a subset of a table.

I wanted to use the logical table concept from DOP so that I could create a super-set of the data, so I defined a DOP project, and my defres blocks, but when it came time to define a defres for my split data, I couldn't figure out a way to use my imperative-style table-split-randomly. I had to redefine table-split-randomly and iloc in terms of DOP, thus duplicating the work and code -- but the form is essentially the same, I am just using cl-ana.makres-table:dotab instead of cl-ana.table:do-table.

ghollisjr commented 5 years ago

The good news is that using DOP does allow for the use of a normal Lisp function/macro approach, and you can choose which added features you want by specifying the pipeline of table transformations in the required defproject form. So, you can gradually introduce parts of the DSL as you want to make use of it.

The main problem with not using makeres-table while trying to analyze data in tables is defining targets. E.g., if you want to perform a single loop over a table and get a number of different results afterwards, then you will need to store those results together in some kind of container, which would usually be a list. But, you will most likely want to refer to those results by meaningful names, not just by remembering which index of the list to use. That's what makeres-table allows you to do. If you are willing to define targets that manually take care of those issues, then you can already avoid the makeres-table DSL and use DOP to store results, or you can avoid DOP altogether and store results however you want.

ghollisjr commented 5 years ago

There are also some concepts introduced by various table transformations, such as makeres-table, which don't have an analog outside of the DSL. ltab is a good example: ltab defines a target which always has a value of NIL, but it conceptually defines a new table which will exist during a call to makeres.

ltab is a way to avoid creating a real table which will take space on disk or in memory; makeres-table modifies the context of the code looping over the real source table such that the code seems as though it were iterating over a real table.

ghollisjr commented 5 years ago

So I think that ltab is an example of a feature that is almost impossible to replicate outside of makeres-table and DOP, or something with a similar learning curve.

jcguu95 commented 3 years ago

How is this project going?

ghollisjr commented 3 years ago

Lately I was busy finishing other things, but I should have more time to dedicate to this project in the near future. I may import a lot of code that was revealed as useful during the particle physics data analysis in which I used cl-ana.

I think there could be two simultaneously development efforts, one which adds conventional data analysis functions and another which focuses on the dependency-oriented paradigm, because they're orthogonal and can be used together or separately.

As it stands, cl-ana is ready to use for data analysis work, including very large scale analysis. In my own research, I added a custom module for connecting it to a remote computing cluster so that it could generate and run C++ programs to analyze data on the cluster. That may be part of the code import in some way or another, but I'd like it to be more generally useful than what I wrote for my own use. It would be entirely possible to create a general package for having a custom cluster interface with cl-ana though. The end result would be that you could write code that to the programmer looks like it might as well run locally, but behind the scenes it would be transferring parameters and files back and forth with the cluster, finally resulting in some data subsets, histograms, plots, or other things being stored on the local disk.

ghollisjr / cl-ana