mratsim / Arraymancer

A fast, ergonomic and portable tensor library in Nim with a deep learning focus for CPU, GPU and embedded devices via OpenMP, Cuda and OpenCL backends
https://mratsim.github.io/Arraymancer/
Apache License 2.0
1.33k stars 96 forks source link

contribution: scikit-learn style machine learning estimators? #238

Open metasyn opened 6 years ago

metasyn commented 6 years ago

Hi All (& @bluenote10 / @mratsim specifically),

I just started looking into Nim more seriously, and this project really caught my eye. So to start, I wanted to say nice work so far :)

I was curious as to whether or not the maintainers would be open to contributions of scikit-learn style estimators? It seems like there is a lot of room for arraymancer to grow. At first, I considered making a separate repo, since arraymancer seemed more akin to numpy than scikit-learn, and there is value in separation. However, after looking through more examples (e.g. PCA, or a least squares solver) I figured I shouldn't make so many assumptions, and that it might make more sense to add to this repo directly rather than making a separate project?

So, to summarize, is there openness or willingness to expand this library to support higher level and varied machine learning estimators similar to those in torch/scikit-learn/spark specifically in this repo? Or would a separate repo/project be better?

Best, Xander

mratsim commented 6 years ago

Hello Xander and welcome here,

If you mean adding random forests, Naives Bayes and co, yes I plan to add them in the future.

Regarding the API for estimators (random_forests, ...), I am still not sure which way to go between a procedural API or an object API (fit + transform + predict). The later can also be added to the current neural network to give even more a Keras-like feel.

For encoders and pipelines (binarizers, label encoder, one-hot encoder, ...), I will definitely not use Scikit-Learn API as it's not suitable for competitive data science:

bluenote10 commented 6 years ago

+1 would be nice to have a few standard algorithms implemented on top of Arraymancer.

For encoders and pipelines (binarizers, label encoder, one-hot encoder, ...), I will definitely not use Scikit-Learn API as it's not suitable for competitive data science.

That is exactly my experience as well, and I've moved away from fit/predict/transform to a simpler, but more flexible interface. This interface is somewhat specific to Python (or rather dynamic typing), so I would have to think if this makes sense in Nim as well.

Another aspect of pipelines: In general there might be a need for working with heterogenous data -- in Python terms: where X isn't a numpy array, but rather a pandas dataframe. I started working on dataframes on top of Arraymancer here, but I didn't get far yet (mainly lack of time). Probably a topic for a different RFC...

In general, I think the choice of pipeline structure is somewhat independent of implementing the algorithms. Or rather: It should be possible to wrap a procedural API easily into an object API later.

mratsim commented 6 years ago

For the pipelines, I have a Python implementation up-and-running with a procedural API.

It works quite well and is flexible (can deal with dataframes, Numpy, databases), scalable, very fast even in Python and with proper experiments logging for reproducibility.

Teaser for the current HomeCredit-Default-Risk competition. ~350k training observations, multiple CSVs including some with over 10M rows and hundreds of columns.

2018-05-22_14-38-21

Anyway @metasyn: go ahead for the algorithms part.

metasyn commented 6 years ago

I see - sounds great, and thanks for the clarification. I do have a few more questions regarding design before I really dig into anything (especially being brand new to nim)...

re: bluenote10's comment-

That is exactly my experience as well, and I've moved away from fit/predict/transform to a simpler, but more flexible interface. This interface is somewhat specific to Python (or rather dynamic typing), so I would have to think if this makes sense in Nim as well.

Could you expand onw hat you mean by more flexible interface?

another:

In general, I think the choice of pipeline structure is somewhat independent of implementing the algorithms.

That makes sense to me. When I was thinking through how I would add a very simple algorithm, e.g. kmeans clustering, I started thinking about whether or not there would be an object estimator akin to sklearn/mllib or if the model itself would merely be an object containing the relevant information about the model (in the kmeans example, e.g. k, centroids, and distance metric, etc.)...

I wasn't sure though, completely, reading your comments, what kind of design you guys are really interested in having, and want to make sure I'm following a little more...

For example, when I look at the PCA implementation in arraymancer, its a single proc, meaning that there is no state to save and no side effect, barring the usage of a PCA "model" to transform new results without recomputing the principal components. It seems useful, at least at some point, to have the components saved (or be able to save them).

So, to go back to the kmeans example, if I copied the pca proc, it would be the equivalent of sklearn's fit_transform, while we throw away the estimator.

Thanks for helping me understand how to contribute! :)

mratsim commented 6 years ago

It's not undesirable, I just went for the lowest hanging fruit.

I could see the following in the future:

proc pca*[T: SomeReal](x: Tensor[T], nb_components = 2, transform: static[bool] = true): Tensor[T] =
  ## If transform: output the dimensionality reduced tensor
  ## Else output the first n-principal components feature plane
  ...

proc pca*[T: SomeReal](x, principal_axes: Tensor[T]): Tensor[T] =
  ## Project a tensor on its principal axes
  ...

Basically, I prefer to have a low-level layer using stateless procs:

mratsim commented 6 years ago

FYI, I've published my Python ML architecture for competitive data science: https://github.com/mratsim/home-credit-default-risk.

It has a focus on both processing and iteration speed, maintainability and scalability.

Highlights: