Strided containers - Githubissues

superbobry commented 8 years ago

Have you considered making Vector.t strided (c.f. NumPy)?

The major benefits of this approach are:

Memory efficiency. No pointer chasing for whole-matrix operations.

Code reuse between Vectors and Matrices, e.g.

(** Converts a matrix to a row-major vector in O(1) time. *)
let flatten : t -> Vector.t = ...

let add m1 m2 = Vector.add (flatten m1) (flatten m2)

Update: I probably should've added this as a comment to #27.

rleonid commented 8 years ago

The short answer is not Vector.t specifically. Vector.t (and Matrices) is a stop gap that is woefully insufficient for what I actually want for a linear algebra and statistical analysis supporting set of data structures.

The long answer is yes and I was working on implementing this in Bau. It wasn't a show stopper but I quickly arrived at the conclusion that one needed ppx rewriters to achieve fast performance on BigArray iteration. My next step was going to be in creating a variant layer that would manage data representation with minimal copying. But that is a very big task that I didn't (and unfortunately still don't) have the bandwith to tackle.

Perhaps this is a good spot as any to write down my thoughts and experiences on the subject. I think there are several issues in this space, some of which interact in various ways:

Data representations layer. I chose float arrays out of convenience, but depending upon the situation (size, convenience) one can make arguments for BigArrays or float lists. I'm still not certain whether the library should remain flexible (operate over different inputs), strict (one type and everything is converted into that type) or intelligent (take as input lots of different types, and copy when/if needed). My inclination is for an "intelligent" (my naming of that approach exposes my bias) approach, but how to balance the features (and speed) found in a vector based language (ex. J) versus one that is expressive of statistical problems (ex. R) is still unknown to me. The other complications in this area are:
- Modular Implicits - Would obviate functorizing operations, but unknown time frame.
- Flambda - Seems like the way of the future, how should it effect the design?
- Oml's target audience - there is some interest in separating the capabilities in the library away from the low level Bigarray, Lacaml stuff. To me this seems pressing and perhaps a good argument for sticking with float array's at the moment.
Data operations. Linear Algebra and Statistical analysis (which are usually just folds over a set of floats. I want these things to have more of an OCaml/EDSL style (transposing a matrix 2x should be a no-op) and they should be agnostic to the actual representation (computing the mean should be an operation that can take an float array, float list ... etc).

All of this shouldn't discourage you from submitting a PR; working code is infinitely more valuable than no code. I just want to share my thinking on the subject.

dbuenzli commented 8 years ago

A bit OT but note that things like flambda are not available in bytecode which is what js_of_ocaml starts from.

Which leads me to something I wanted to ask for a long time, how much of oml can be compiled to the browser with js_of_ocaml ? E.g. bigarrays are available there, but I doubt lacaml stuff can be brought there.

rleonid commented 8 years ago

I did not realize that bigarrays are available on js_of_ocaml! But you're also pointing at another subtle issue of bytecode vs native.

I've reserved a bit of time this and next week to see how much of oml I can built without the lacaml, lbfgs; An OCaml only oml. This will precipitate even more build system hacking, but hopefully an elegant solution will materialize.

rleonid commented 8 years ago

@dbuenzli If you have a chance take a look at oml-lite. Some of the big stuff (ex. multivariate and logistic regression) is missing. But it is viable, and hopefully the framework for iteration has been laid.

dbuenzli commented 8 years ago

@rleonid Caveat, I have little knowledge on how oml is structured (I never used it, yet). However I do have the impression that by extracting oml-lite the way you did through cppo means 1) You are going to live a miserable build life in the long term 2) Other third-party libraries wanting to build on top of oml will have to choose between oml-lite and oml and the final user of the third-party library may not agree with the choice the latter did.

My approach would be to rather try to reorganize the API so that oml-lite shows up naturally as a single library with others libraries gradually mixing in the C and fortran dependencies. The whole could then be distributed through a single package, the sub-libraries letting end-user control the amount of C or fortran they want to bring in their code base.

rleonid commented 8 years ago

@dbuenzli Thank you for your feedback. I've replied here https://github.com/hammerlab/oml/issues/173, so maybe we can keep these discussions organized.

rleonid / oml

Strided containers #166