libmir / mir-algorithm

Dlang Core Library
http://mir-algorithm.libmir.org
Other
173 stars 37 forks source link

Numpy bindings #43

Closed EelcoHoogendoorn closed 6 years ago

EelcoHoogendoorn commented 7 years ago

Just found this package, and I must say I love the concept of the ndslice; really raises the bar in terms of functional numerical computing, id say. Best concept since the ndarray itself!

One thing I was wondering about; I found this project while searching for numpy-D bindings; and I don't see any projects to that end yet; nor do I see, for example, matplotlib getting ported to D anytime soon. As much as I like D and the ndslice, I personally struggle to think of any projects that I work on that would get a net productivity gain from doing them in D instead of python.

So my question is; would you consider such functionality to fit within the scope of this project? It certainly would generate a lot of interest for this repo, if all the pieces were in place to make D-ndslice extensions to python code a frictionless affair.

9il commented 7 years ago

Sure. I thought about this before. This would be awesome addition! You can count on my support for D side. I can write basic D functions if you describe their API. BTW, you may be interested to look into PyD and related its issue https://github.com/ariovistus/pyd/issues/53 .

mir-algorithm is a source library that does not contain any code that should be compiled. I can create a mir-numpy project or we can use PyD as repo. Both variants are good.

9il commented 7 years ago

All generic algorithmic code can be placed in mir-algorithm

EelcoHoogendoorn commented 7 years ago

Glad to hear you share that interest. I used to work with D extensively, but like 10 years ago; and got lured away by languages with a better ecosystem. Just looking to get back into it; so, im not well-versed to comment on the D side of things at present, but glad to see you got that covered. Feel free to point out where I make faulty assumptions.

Here are some possible sources of inspiration:

Boost recently upgraded its numpy support, for instance: https://github.com/boostorg/python/blob/master/example/numpy/gaussian.cpp

Then there is xtensor-python, which has a more ndslice-like intent, and a binding system that is derived of (but independent from) boost-python) https://github.com/QuantStack/xtensor-python

xtensor-python uses setuptools to build things; there is also https://pypi.python.org/pypi/setuptools-rust; seems like pyd has us covered with the same type of functionality, so no need to reinvent that wheel: http://pyd.readthedocs.io/en/latest/distutils.html

wrt memory management: Ive rolled my own numpy-C++ solution in the past, and then it was most convenient to do all allocations through the numpy memory pool; but the discussion on your pyd issue seems to indicate otherwise, and I can imagine the D GC making it a different affair indeed.

API wise there is a thing or two to consider... ndslice is itself not a container, and all examples I have seen that write a slice do so to a regular contiguous array. This may be alright for passing data from numpy to D; just get an ndslice that views the numpy arrray by copying the strides and such. But it is not so clear cut how to pass ndslices back to numpy.

Mapping an D-ndarray container to numpy-ndarray is a much more well defined problem. Passing ndslices between languages by contrast... wouldnt know where to start with that. And having implicit conversions at the language interface strikes me as a can of worms as well. As someone once put it; the real power of numpy does not lie in any of the algorithms it contains, but in the uniform interface it provides through the ndarray. That is what really makes the ecosystem fly.

So that pleads in favor of implementing a simple ndarray container in D. Passing that to and from numpy should be easy. And I imagine it shouldnt be too much work in D; it really does not need to implement any algorithms, or methods even; it shouldnt be seen as an attempt to rebuild numpy in D. It just needs to store a pointer, strides and shape, and needs to handle ownership of memory. By explicitly converting from ndslice to that format before passing to numpy, things should be quite clear.

Alternatively, one could enforce a fixed layout at the numpy-D boundary; check at runtime for c-contiguity when passing into D, and always return contiguous arrays. That wouldnt be a huge restriction; most of the time you can still proceed across the language boundary without making a copy; and if you are super hung up on avoiding unnecessary traversing of memory you shouldnt be using numpy in the first place. But arbitrary striding is about more than just performance; it can also have functional uses (need to modify a strided view in D, f.i.). And I think implementing the numpy-ndarray interface on the D side really shouldnt be that hard.

The design (not the execution per se) of the declarative interface to map numpy-ndarrays to D-ndarrays should then be a trivial affair as well; if boost can do it, so can D. Just a matter of copying the pointer, shape, strides (and map between numpy's runtime mutability flags and D's type system?), and making sure we dont do anything dumb with refcounts.

Do you agree that such a barebones D-ndarray makes sense as an intermediary? Chances are I am missing a ton of subtleties about the design / intent of D and ndslice that make this a terrible idea.

EelcoHoogendoorn commented 7 years ago

Note that my thinking about this is motivated exclusively by a desire to write D extensions to python programs; havnt really considered the other direction yet. But I think the same arguments apply; just in reverse.

When passing from D to numpy, we may simply write ndslices to contiguous storage and construct a numpy view around it; but we lose any functionality that relies on a nontrivial pattern of aliasing between the input arguments. Accepting the arbitrary-layout numpy return values as ndslices should be easy again.

Again, having a barebones D-ndarray would solve these issues, giving us the liberty to pass in exactly the kind of aliasing to the numpy function that we intend to. (not exactly actually; numpy strides are in bytes, not elements, so unless ndslice goes that way as well, those kind of conversions should be a runtime error if they dont match; but I havnt found a use for overlapping array elements yet so id be content with the runtime error)

Not sure I fully grok the memory management yet. The issue you linked makes it sound trivial; but I cant say i have fully convinced myself that it is.

Another argument in favor of an ndarray container type that springs to mind, are use cases such as serialization of arrays. For instance, I added this functionality to jsonpickle https://github.com/jsonpickle/jsonpickle/pull/145

Marshaling and serialization are pretty closely related, and correctly serializing / unserializing ndarrays is a similar problem to traversing a language boundary. Any kind of forced or implicit copy to accomodate the limitations of your mapping will impose limits on functionality.

Anyway, what I suppose I am trying to get at; just as a range/slice does not replace all functions of the array, so does the ndslice not replace all functions of the ndarray. Or perhaps we can say that the ndslice replaces the ndarray that does not own its own memory (views). And ndarrays that do own their own memory are indeed always contiguous in numpy. So there already is a complement to all numpy functionality in d, inbetween its regular multidimensional arrays and ndslices.

However, the semantics of numpy fundamentally hide that distinction (owning vs non-owning) ndarray, whereas that currently is not the case in D. I think that is probably the most concise way of stating the crux of the matter of mapping between numpy and D. Or maybe not, the analogy between view and ndslice as an interface is incomplete; an ndarray may not own its memory, but it always refers to it at least, and a lazy D ndslice need not. Thats why we cant pass them to a numpy-ndarray; since the latter fundamentally always point to memory.

Thats enough rambling for now I suppose... regardless I am pretty sure that mimicking the basic ndarray struct on the D side (shape, strides, and memory ownership/referral), such that the language barrier itself is really just the simplest possible direct mapping, make sense in terms of separation of concerns.

9il commented 7 years ago

So that pleads in favor of implementing a simple ndarray container in D. Passing that to and from numpy should be easy. And I imagine it shouldnt be too much work in D; it really does not need to implement any algorithms, or methods even; it shouldnt be seen as an attempt to rebuild numpy in D. It just needs to store a pointer, strides and shape, and needs to handle ownership of memory. By explicitly converting from ndslice to that format before passing to numpy, things should be quite clear.

Alternatively, one could enforce a fixed layout at the numpy-D boundary; check at runtime for c-contiguity when passing into D, and always return contiguous arrays. That wouldnt be a huge restriction; most of the time you can still proceed across the language boundary without making a copy; and if you are super hung up on avoiding unnecessary traversing of memory you shouldnt be using numpy in the first place. But arbitrary striding is about more than just performance; it can also have functional uses (need to modify a strided view in D, f.i.). And I think implementing the numpy-ndarray interface on the D side really shouldnt be that hard.

  1. If one want to use ndslice there is no alternative to runtime error. Number of dimensions is compile time argument. Would we implement ndarray view or not does not matter for this issue.

  2. Universal ndslices has arbitrary strides for all dimensions. What we should modify in strided view (is it Slice structure?)?

9il commented 7 years ago

@EelcoHoogendoorn what do you think about betterC way for D plugins for numpy?

Then we would not have any DRuntime-related problems. No problems with GC, and no problems with DRuntime linking. So D plugins would be as simple for distribution as C plugins and can be packed to official linux repositories.

EelcoHoogendoorn commented 7 years ago

If one want to use ndslice there is no alternative to runtime error. Number of dimensions is compile time argument. Would we implement ndarray view or not does not matter for this issue.

I am not sure what you mean by 'If one want to use ndslice'; indeed I think passing a ndslice from D to python is out of the question; numpy would have no idea what to do with them. Theyd have to be mapped either to standard D arrays first, or a numpy-like-ndarray. numpy would have no idea of what to do with an 'iota', for instance.

Universal ndslices has arbitrary strides for all dimensions. What we should modify in strided view (is it Slice structure?)?

Again, I am not quite sure I follow. Indeed going from python to D should be easy; wrapping an ndarray in an ndslice should be no harder than wrapping a D array in an ndslice; no need to modify anything about the strides; they can just be copied straight away. But I do not think this direct approach is ideal.

That is; since we have to use a container type (not a range or view) for passing ndslices from D to numpy, I think the cleanest solution is to use that same container type when going from numpy to D as well. The only inconvenience is having to call .sliced on the object you get from numpy, but this makes the interface more symmetric; and also more general for users who do not care for ndslice semantics. If pyd did the mapping from numpy ndarray to a basic D-ndarray (just copying the pointers and shape and size over), then the only thing mir would have to concern itself with is writing ndslices to such pyd-ndarrays, and making sure .sliced worked on them as intended.

EelcoHoogendoorn commented 7 years ago
@EelcoHoogendoorn what do you think about betterC way for D plugins for numpy?

Then we would not have any DRuntime-related problems. No problems with GC, and no problems with DRuntime linking. So D plugins would be as simple for distribution as C plugins and can be packed to official linux repositories.

I am not familiar with betterC; but as far as integrating with python is concerned, I was thinking about creating a cross-platform conda package for the D compiler and for pyd; much the same as you would build and link a cpp extension using a cpp compiler and boost-python.

What specific problems do you foresee for DRuntime linking?

9il commented 7 years ago

What specific problems do you foresee for DRuntime linking?

They are described here in point number 2 https://gist.github.com/ximion/fe6264481319dd94c8308b1ea4e8207a

9il commented 7 years ago

BetterC is D subset that does not require DRuntime to be linked.

9il commented 7 years ago

If one want to use ndslice there is no alternative to runtime error. Number of dimensions is compile time argument. Would we implement ndarray view or not does not matter for this issue.

I am not sure what you mean by 'If one want to use ndslice'; indeed I think passing a ndslice from D to python is out of the question; numpy would have no idea what to do with them. Theyd have to be mapped either to standard D arrays first, or a numpy-like-ndarray. numpy would have no idea of what to do with an 'iota', for instance.

Yes, we may introduce special type for interaction. BTW, I do not think we should provide special logic for iota for example. Lets restrict API with the types that are supported in numpy.

9il commented 7 years ago

That is; since we have to use a container type (not a range or view) for passing ndslices from D to numpy, I think the cleanest solution is to use that same container type when going from numpy to D as well.

LGTM

If pyd did the mapping from numpy ndarray to a basic D-ndarray (just copying the pointers and shape and size over), then the only thing mir would have to concern itself with is writing ndslices to such pyd-ndarrays, and making sure .sliced worked on them as intended.

It copies data if I am not wrong.

Low level API from scratch without PyD is preferable. You wrote that you would like to use D plugins in Python. Lets consider only this direction. PyD has very hight level API. I am not against hight level API, but first low level API should be exist. And this should be a small betterC (Dlang subset) library for low level interaction with python.

EelcoHoogendoorn commented 7 years ago

Yes, we may introduce special type for interaction. BTW, I do not think we should provide special logic for iota for example. Lets restrict API with the types that are supported in numpy.

Exactly; the only thing going in and out of python (and thus coming in and out of D) should be a struct with a pointer and shape and stride tuples; because thats the only thing numpy knows how to deal with.

I think the main question on the D side is; do we wish to implicitly convert this format to and from ndslice, or do we want the conversion between ndarray and ndslice to be explicit on the D side; and I favor the latter.

EelcoHoogendoorn commented 7 years ago

It copies data if I am not wrong.

Indeed plain array types are copied, as i understand it; but when I say D-ndarray I do not mean a D multidimensional array, but rather a mimick of the numpy-C-ndarray datatype. Sorry for the miscommunication.

What I mean is something like this:

struct ndarray {
    void* data;
    int[] shape;
    int[] strides;
}

Just a mirror of the numpy-C-struct, that does not require any copying of the data being pointed to.

Or actually, there might be a difference; one of the questions that pops up here is where to place the boundary between numpy's dynamically typed ndarrays, and D's statically typed ndslices. Probably best to do the remapping to static types here too. Make the type of the data pointer a generic and perform the necessary runtime checks to see that the dtype passed in from numpy is indeed the same type as demanded by the static type in the D interface declaration; and then map the static type back to a dtype attribute when going back to python.

9il commented 7 years ago

I think the main question on the D side is; do we wish to implicitly convert this format to and from ndslice, or do we want the conversion between ndarray and ndslice to be explicit on the D side; and I favor the latter.

LGTM, lets create the latter case first.

struct NumpyArray {
    size_t n;
    size_t* shape;
    size_t* strides;
    void* data;
}

(we can use the same order as numpy view, i do not know it)

Or actually, there might be a difference; one of the questions that pops up here is where to place the boundary between numpy's dynamically typed ndarrays, and D's statically typed ndslices. Probably best to do the remapping to static types here too. Make the type of the data pointer a generic and perform the necessary runtime checks to see that the dtype passed in from numpy is indeed the same type as demanded by the static type in the D interface declaration; and then map the static type back to a dtype attribute when going back to python.

We can place boundary and all checks in D side. D compile time reflection is very powerful now and we can start from the lowest possible API and write library functions to automate bindings.

Laeeth commented 7 years ago

"One thing I was wondering about; I found this project while searching for numpy-D bindings; and I don't see any projects to that end yet; nor do I see, for example, matplotlib getting ported to D anytime soon. As much as I like D and the ndslice, I personally struggle to think of any projects that I work on that would get a net productivity gain from doing them in D instead of python."

Just so you know, it's pretty easy to call matplotlib from D. And you can even write Python in one cell in an IPython notebook, D in another and have them call each other. A certain Mr Yaroshenko wrote about it here (second link): https://github.com/DlangScience/PydMagic https://d.readthedocs.io/en/latest/examples.html#plotting-with-matplotlib-python

In some of my work I embed Python script code in D as strings. Then I replace them later with pure D when I have more time.

I recognise this is at a tangent to your main point, but perhaps it is interesting.

wilzbach commented 7 years ago

"One thing I was wondering about; I found this project while searching for numpy-D bindings; and I don't see any projects to that end yet; nor do I see, for example, matplotlib getting ported to D anytime soon. As much as I like D and the ndslice, I personally struggle to think of any projects that I work on that would get a net productivity gain from doing them in D instead of python."

For examples using Matplotlib in D, you might be interested in e.g. this very crude histogram wrapper that I wrote last summer:

https://github.com/libmir/mir-random/blob/master/examples/flex_plot/flex_common_pack/flex_common/hist.d

It's nothing special, but I hope it does show that using Python/Matplotlib in D is quite easy.

I see, for example, matplotlib getting ported to D anytime soon.

There have been some small-scale approaches in the past, e.g. matplotlib-d or my silly matplotd. There's also ggplotd - a pure D plotting library with a ggplot2-inspired API. FYI: I do intend to do a proper, general-purpose version of matplotd during my next semester holidays (this August & September)

Laeeth commented 7 years ago

@wilzbach seb if you get this please see email. if you don't see an email from me then some routing problem and let's speak on phone. Laeeth.

Laeeth commented 7 years ago

"FYI: I do intend to do a proper, general-purpose version of matplotd during my next semester holidays (this August & September)" +1 :)

9il commented 6 years ago

Low level solution http://docs.algorithm.dlang.io/latest/mir_ndslice_connect_cpython.html High level solution https://github.com/ShigekiKarita/mir-pybuffer