contribution: scikit-learn style machine learning estimators?

metasyn commented 6 years ago

Hi All (& @bluenote10 / @mratsim specifically),

I just started looking into Nim more seriously, and this project really caught my eye. So to start, I wanted to say nice work so far :)

I was curious as to whether or not the maintainers would be open to contributions of scikit-learn style estimators? It seems like there is a lot of room for arraymancer to grow. At first, I considered making a separate repo, since arraymancer seemed more akin to numpy than scikit-learn, and there is value in separation. However, after looking through more examples (e.g. PCA, or a least squares solver) I figured I shouldn't make so many assumptions, and that it might make more sense to add to this repo directly rather than making a separate project?

So, to summarize, is there openness or willingness to expand this library to support higher level and varied machine learning estimators similar to those in torch/scikit-learn/spark specifically in this repo? Or would a separate repo/project be better?

Best, Xander

mratsim commented 6 years ago

Hello Xander and welcome here,

If you mean adding random forests, Naives Bayes and co, yes I plan to add them in the future.

Regarding the API for estimators (random_forests, ...), I am still not sure which way to go between a procedural API or an object API (fit + transform + predict). The later can also be added to the current neural network to give even more a Keras-like feel.

For encoders and pipelines (binarizers, label encoder, one-hot encoder, ...), I will definitely not use Scikit-Learn API as it's not suitable for competitive data science:

no way to plug XGBoost/LightGBM early-stopping using validation data: https://github.com/scikit-learn/scikit-learn/issues/8414
no way to cache intermediate results
redundant computations across folds
no way to cross_val_predict (out-of-folds prediction)

bluenote10 commented 6 years ago

+1 would be nice to have a few standard algorithms implemented on top of Arraymancer.

For encoders and pipelines (binarizers, label encoder, one-hot encoder, ...), I will definitely not use Scikit-Learn API as it's not suitable for competitive data science.

That is exactly my experience as well, and I've moved away from fit/predict/transform to a simpler, but more flexible interface. This interface is somewhat specific to Python (or rather dynamic typing), so I would have to think if this makes sense in Nim as well.

Another aspect of pipelines: In general there might be a need for working with heterogenous data -- in Python terms: where X isn't a numpy array, but rather a pandas dataframe. I started working on dataframes on top of Arraymancer here, but I didn't get far yet (mainly lack of time). Probably a topic for a different RFC...

In general, I think the choice of pipeline structure is somewhat independent of implementing the algorithms. Or rather: It should be possible to wrap a procedural API easily into an object API later.

mratsim commented 6 years ago

For the pipelines, I have a Python implementation up-and-running with a procedural API.

It works quite well and is flexible (can deal with dataframes, Numpy, databases), scalable, very fast even in Python and with proper experiments logging for reproducibility.

Teaser for the current HomeCredit-Default-Risk competition. ~350k training observations, multiple CSVs including some with over 10M rows and hundreds of columns.

2018-05-22_14-38-21

Anyway @metasyn: go ahead for the algorithms part.

metasyn commented 6 years ago

I see - sounds great, and thanks for the clarification. I do have a few more questions regarding design before I really dig into anything (especially being brand new to nim)...

re: bluenote10's comment-

That is exactly my experience as well, and I've moved away from fit/predict/transform to a simpler, but more flexible interface. This interface is somewhat specific to Python (or rather dynamic typing), so I would have to think if this makes sense in Nim as well.

Could you expand onw hat you mean by more flexible interface?

another:

In general, I think the choice of pipeline structure is somewhat independent of implementing the algorithms.

That makes sense to me. When I was thinking through how I would add a very simple algorithm, e.g. kmeans clustering, I started thinking about whether or not there would be an object estimator akin to sklearn/mllib or if the model itself would merely be an object containing the relevant information about the model (in the kmeans example, e.g. k, centroids, and distance metric, etc.)...

I wasn't sure though, completely, reading your comments, what kind of design you guys are really interested in having, and want to make sure I'm following a little more...

For example, when I look at the PCA implementation in arraymancer, its a single proc, meaning that there is no state to save and no side effect, barring the usage of a PCA "model" to transform new results without recomputing the principal components. It seems useful, at least at some point, to have the components saved (or be able to save them).

So, to go back to the kmeans example, if I copied the pca proc, it would be the equivalent of sklearn's fit_transform, while we throw away the estimator.

Do you guys have ideas on how the "model" or "estimator" can be persisted after the initial transformation?
- if so, what might that look like for PCA?
- if not, is this undesirable?

Thanks for helping me understand how to contribute! :)

mratsim commented 6 years ago

It's not undesirable, I just went for the lowest hanging fruit.

I could see the following in the future:

proc pca*[T: SomeReal](x: Tensor[T], nb_components = 2, transform: static[bool] = true): Tensor[T] =
  ## If transform: output the dimensionality reduced tensor
  ## Else output the first n-principal components feature plane
  ...

proc pca*[T: SomeReal](x, principal_axes: Tensor[T]): Tensor[T] =
  ## Project a tensor on its principal axes
  ...

Basically, I prefer to have a low-level layer using stateless procs:

For FFI, it's easier to work with proc than opaque objects if there is a C, Python or JS wrapper for Arraymancer.
High level object API can be build on top on that, similar to the neural network primitives where everything is passed by proc and neural network which builds a Graph object.
You don't commit to an API before actual use: procs are easier to deprecate and maintain, especially if I go wrong and another API was preferred. Deprecating an object type would be quite disruptive instead.

mratsim commented 6 years ago

FYI, I've published my Python ML architecture for competitive data science: https://github.com/mratsim/home-credit-default-risk.

It has a focus on both processing and iteration speed, maintainability and scalability.

Highlights:

Caching
SQL or pandas based processing
Support for parallel feature engineering (but due to Python Global Interpreter Lock, data must be serialized to disk to be shared between processes so it's slower than single processing :/)
Extensive instrumentation and logging of experiments to detect feature importance, diff between experiments

mratsim / Arraymancer

contribution: scikit-learn style machine learning estimators? #238