Closed MichaelRinger closed 10 months ago
Do you need only the random.randn
function? I would actually go for random.standard_normal, and wrap it with random.randn
, as is done in numpy
. See also the comments in Quick start.
Atm I only the random.randn
function. Yeah creating the random.standard_normal
and add the alias random.randn
, like its done in numpy, is better.
I would vote for the new nomenclature in https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.Generator.
However, a more fundamental question is what we should do with the None
seed.
If None, then fresh, unpredictable entropy will be pulled from the OS.
Where should we pull this information from, if there is no underlying OS?
There already is a function random.seed(n=None)
in micropythons random standard library that generates a seed, by getting a random number from the hardware.
When no argument (or None) is passed in it will (if supported by the port) initialise the PRNG with a true random number (usually a hardware generated random number).
The None case only works if MICROPY_PY_URANDOM_SEED_INIT_FUNC is enabled by the port, otherwise it raises ValueError.
OK, but if that is the case, then you can already use this via the vectorize method. Is that prohibitively expensive?
Another question is, if we were to re-implement this in ulab
, what would happen, if a particular port does not support a hardware random number generator?
@MichaelRinger Do you have comments on the second question?
@MichaelRinger Do you have comments on the second question? I am using the vectorize method of numpy in combination with the base random func in micropython to solve. Thanks for the help.
@MichaelRinger Did you close this by accident? I was actually going to add the module...
Hi @v923z , sorry to bump this, did you end up adding the module? can't seem to find it
@ammar-bitar No, I haven't added, because the OP closed the issue, and there was no response to the question after that. Do you need the module? If so, which methods? It would not be too much work to add, but it was not clear to me, whether there is interest.
I'd like to upvote this. I think numpy is half crippled without a vectorized RNG.
The micropython random
module is a usable workaround, only when speed is not an issue.
Not sure though which RNG has decent entropy while being small enough in code and work memory sizes.
Specifically the ESP32 that I'm interested in has a hardware RNG that according to the docs yields 32-bit integers in <100 CPU cycles.
micropython
uses yasmarang, which fares quite well in random number tests, as is cheap.
I opened a discussion some time ago here: https://github.com/orgs/micropython/discussions/10550, and I think I could pick up the thread from there, and implement the module in a reasonable time.
In the meantime, as suggested above, you can use the vectorize
function. That could easily save you a factor of two in execution time.
Thanks. I wasn't aware of vectorize
.
As for yasmarang, I think it really is too spartan... a 32 bit state is not enough, even for a microcontroller. Especially seeing as in the other thread you were worried about quality (of seeding).
I think that any hardware capable of running a python-like interpreter with per-object overhead and a GC, can afford a few 10s of bytes for an RNG with several 100 bits of state. Even more so if the system is capable of holding ndarray
s. No need to be frugal about it.
If you do end up implementing an RNG, I implore you to forget micropython's poor choice of yasmarang and consider the much more modern xoroshiro, that has a 32-bit variant with a 128-bit state and a 64-bit variant with a 256-bit state, useful for the fortunate who have double precision floats.
Looking at the next()
function in the xoroshiro source, it looks like code size is comparable to yasmarang, and can be tuned with choice of inlining and constant folding.
On a microcontroller, the sub-ns/bit performance they promise is not attainable, but it will still be fast.
Also I would advise against taking micropython's (following original Python's) path of having one global static generator state. If we already paid the code size overhead, why hardwire it to a single piece of memory...
Thank you so much for the effort you put into ulab
.
As for yasmarang, I think it really is too spartan... a 32 bit state is not enough, even for a microcontroller. Especially seeing as in the other thread you were worried about quality (of seeding).
OK, then I would start here https://www.pcg-random.org/.
I think that any hardware capable of running a python-like interpreter with per-object overhead and a GC, can afford a few 10s of bytes for an RNG with several 100 bits of state. Even more so if the system is capable of holding
ndarray
s. No need to be frugal about it.If you do end up implementing an RNG, I implore you to forget micropython's poor choice of yasmarang and consider the much more modern xoroshiro, that has a 32-bit variant with a 128-bit state and a 64-bit variant with a 256-bit state, useful for the fortunate who have double precision floats.
I think the actual choice of the RNG is irrelevant from the viewpoint of implementation effort. If you don't like a particular RNG, we can always replace it later by a better one. But could we use https://github.com/numpy/numpy/blob/3d8cfd68d8df1af4e25e8dffd0a155fda20dc770/numpy/random/src/pcg64/pcg64.c#L64 as a starting point?
Also I would advise against taking micropython's (following original Python's) path of having one global static generator state. If we already paid the code size overhead, why hardwire it to a single piece of memory...
Again, I feel that this is a minor implementation detail that we could change, if necessary. Do you happen to know how numpy
deals with the question?
Also, do you need distributions, too, or just a uniform RNG? Which method of the 'random` module would be important in your application?
I think the actual choice of the RNG is irrelevant from the viewpoint of implementation effort. If you don't like a particular RNG, we can always replace it later by a better one. But could we use https://github.com/numpy/numpy/blob/3d8cfd68d8df1af4e25e8dffd0a155fda20dc770/numpy/random/src/pcg64/pcg64.c#L64 as a starting point?
You are right of course about the actual RNG being immaterial to the ulab
code that provides the user facing functionality.
The problem with PCG specifically is that at its heart are 64-bit by 64-bit multiplications used to implement even larger 128 bit multiplications. This is not cheap for microcontrollers that may struggle to get 32 bit multiplications done quickly. XOR based RNGs are cheap on virtually any hardware.
Anyway as you said the RNG itself is interchangeable and you can also choose to support multiple and have them compiled selectively via configuration file for code size efficiency (see BitGenerator
below).
Again, I feel that this is a minor implementation detail that we could change, if necessary.
True again. But change after deployment is itself expensive. Your discussion thread in the micropython repo proves just that. There is nothing in the yasmarang and seeding code in micropython that prevents your desired use pattern (obtain random seed) or that prevents using yasmarang with multiple generator states instead of a single static one. But their choice not to export the proper interface makes this later extension difficult.
Do you happen to know how numpy deals with the question?
Sure. Old numpy.random
had the RandomState
class that you could instantiate as many times as you need to create new generator states. This is now considered "legacy".
Modern numpy.random
revised this (at the painful cost of breaking compatibility, see above, cost of change after deployment) so that there is clear architectural separation between generating random bits (BitGenerator
class) and the subsequent generation of distributions from those bits (Generator
class).
The discussion we had above about yasmarang, xoroshiro and PCG is all about BitGenerator
s. And you can choose which one(s) to implement.
The whole point of the Generator
class is to be invariant to BitGenerator
. And the nice thing is that the ulab
version of Generator
can start narrow and get expanded later as needed. I doubt many microcontroller users will find the dirichlet
function necessary.
I would start with the following Generator
methods:
.integers()
.random()
, the classic [0.0, 1.0).bytes()
, is essentially the same as .integers()
, possibly slightly more efficient.normal()
, is slightly harder and usually twice the effort of .random()
.uniform()
is a minor extension of .random()
In all of the above I think it's also important to support the out=
keyword argument, so that memory allocation is not forced on users who already own some memory and just want some computation done into it.
BTW I think this out=
feature is generally missing in ulab
.
If you feel like being even more generous to users, these are the second tier methods. Useful, but not as often as the ones above, and require a lot more code. I think they are entirely optional in a microcontroller environment:
.choice()
.shuffle()
, .permutation()
, .permuted()
.binomial()
, .poisson()
You are right of course about the actual RNG being immaterial to the
ulab
code that provides the user facing functionality. The problem with PCG specifically is that at its heart are 64-bit by 64-bit multiplications used to implement even larger 128 bit multiplications. This is not cheap for microcontrollers that may struggle to get 32 bit multiplications done quickly. XOR based RNGs are cheap on virtually any hardware.
OK, this is a good point.
Anyway as you said the RNG itself is interchangeable and you can also choose to support multiple and have them compiled selectively via configuration file for code size efficiency (see
BitGenerator
below).
Yes, this would be a reasonable approach. Thich could even be done automatically, based on whether you have floats or doubles on a microcontroller.
True again. But change after deployment is itself expensive. Your discussion thread in the micropython repo proves just that. There is nothing in the yasmarang and seeding code in micropython that prevents your desired use pattern (obtain random seed) or that prevents using yasmarang with multiple generator states instead of a single static one. But their choice not to export the proper interface makes this later extension difficult.
I understand that, and there is no disagreement in this regard. But we don't have to stick with micropython
's interface, in fact, we should conform to numpy
's. I would, perhaps, suggest that we hash out the interface first, and settle on the particular generator later.
I would start with the following
Generator
methods:
.integers()
.random()
, the classic [0.0, 1.0).bytes()
, is essentially the same as.integers()
, possibly slightly more efficient.normal()
, is slightly harder and usually twice the effort of.random()
.uniform()
is a minor extension of.random()
OK.
In all of the above I think it's also important to support the
out=
keyword argument, so that memory allocation is not forced on users who already own some memory and just want some computation done into it. BTW I think thisout=
feature is generally missing inulab
.
I agree. You're not the first, who noticed this: https://github.com/v923z/micropython-ulab/issues/592
Just out of curiosity, what is your use case? One reason I didn't originally consider the random
module is that it's hard to test on a microcontroller, given that you can't easily plot/analyse generated data.
My use case is quite trivial, I just need some randomness for real-time graphics with many objects moving on the screen.
This is roughly similar to game programming, where to maintain a reasonable frame rate I need to be quick about computing many object state updates. Without the array processing that ulab
allows, using Python loops it would probably be way too slow. So ulab
allows me to maintain states for 100s of tiny objects, with floating point precision, and still have enough time to convert final screen positions to integers and render objects.
So for my use case, just having .uniform()
is enough. .normal()
would be nice to have.
One reason I didn't originally consider the random module is that it's hard to test on a microcontroller, given that you can't easily plot/analyse generated data.
I don't see why you'd need to plot / analyse, and you definitely don't have to do that on-device.
The thing with PRNGs is that you can easily seed them with a known value, generate some sequence data and store it to a file on an SD card. If you want to be really fancy you can send that out over Ethernet or Wifi if so equipped. This data just needs to be bit-identical to the well known PRNG sequence, which can be generated on a laptop/desktop machine from the PRNG's reference code.
As for the Generator
part of it, you create a BitGenerator
subclass that produces a pre-stored or known-seed sequence. Feed that into numpy
's internal Generator
class and get reference results.
Then you do the same with the ulab
implementation of Generator
, again save that to a file and analyze on a "real" computer. Results need to be identical down to floating point roundoff errors, where applicable.
I think this would be practical to test up to sequences of several 100GB using a cheap SD card and a lot of patience to wait. I'm not sure it's necessary to test large volumes of data. Once a PRNG follows the sequence, it will not stray from it, unless its code is insanely written.
For the Generator
part of things that deals with floating point - basically what's required is porting the code in numpy
's Generator
from Python/C to micropython-friendly C. There should be no interface or algorithm changes whatsoever. Just thinning down to supported methods and dtypes. In testing you may want to ensure that some floating point edge cases (denormalized, infinities, boundaries etc) are reached, where relevant. But I think for a first version there is not even a need to bother with that.
My use case is quite trivial, I just need some randomness for real-time graphics with many objects moving on the screen.
This is roughly similar to game programming, where to maintain a reasonable frame rate I need to be quick about computing many object state updates.
In that case, the quality of the generator is not really an issue, I guess. In any case, here is a first implementation. https://github.com/v923z/micropython-ulab/tree/random The out
keyword is not yet supported, but you can do
MicroPython v1.20.0-311-gea1a5e43d on 2023-07-23; linux [GCC 12.2.0] version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> from ulab import numpy as np
>>> rng = np.random.Generator()
>>> rng.random(12)
array([1.688118533849092e-10, 0.7703293013781235, 0.6074833036730274, ..., 0.2697689388821915, 0.6484892791277977, 0.6955782450470177], dtype=float64)
>>> rng.random(size=(5, 5))
array([[0.7724060907456224, 0.1599236337293051, 0.7339406510831528, 0.5274410894040502, 0.6714481027287427],
[0.6492265779888077, 0.06322272656419292, 0.4808784435223349, 0.8659584576783896, 0.6510361989559311],
[0.7035330373879642, 0.1143488863528934, 0.3228607724092158, 0.9696441515163603, 0.3353713223418759],
[0.4890397567261386, 0.8449387836308111, 0.4607909039388876, 0.1653575696739978, 0.4966210702938057],
[0.3707932026109159, 0.7451473233433786, 0.1918072763583848, 0.7469201739384557, 0.7688914647671239]], dtype=float64)
Great. That was fast... Looks good.
This certainly goes the numpy way about having as many generator states as needed. Just need a way to seed them differently from each other :-)
And while you're right that the quality of the generator is not an issue, its speed is, so I would love to have an option to use an XOR based BitGenerator
under the hood of Generator
in addition to the PCG multiplier based one. Doesn't need to go all the way with separating to two classes as in numpy.
Something like np.random.Generator(seed: Union[None, int32, tuple[int32,...]], bitgen: Literal['pcg', 'xoroshiro'] = 'pcg')
The seed tuple is either truncated or repeated to match the BitGenerator's state size.
When seed is an int, treat it as a tuple of length 1.
When seed is None I'm not sure what to do... perhaps require it not to be. This leaves the problem of proper seeding to the user, who probably knows better what is 'proper' for the application.
Great. That was fast... Looks good.
This certainly goes the numpy way about having as many generator states as needed. Just need a way to seed them differently from each other :-)
Sure. I haven't yet had time to add that, but it should be trivial.
And while you're right that the quality of the generator is not an issue, its speed is, so I would love to have an option to use an XOR based
BitGenerator
under the hood ofGenerator
in addition to the PCG multiplier based one. Doesn't need to go all the way with separating to two classes as in numpy.
PCG64 has a single integer multiplication, and bit shifts. Are you concerned about the time it takes to perform the multiplication?
Something like
np.random.Generator(seed: Union[None, int32, tuple[int32,...]], bitgen: Literal['pcg', 'xoroshiro'] = 'pcg')
The seed tuple is either truncated or repeated to match the BitGenerator's state size. When seed is an int, treat it as a tuple of length 1. When seed is None I'm not sure what to do... perhaps require it not to be. This leaves the problem of proper seeding to the user, who probably knows better what is 'proper' for the application.
Well, we've got to be compatible with numpy
. Unfortunately, the documentation on this is rather sketchy, and we can't expect micropython
to support the typing
module, so I'm afraid Union
and the like are out of question. I'd also like to be flash-aware. What you suggest can be done in a simple for
loop in python
as
seeds = [1, 2, 3, 4, 5]
generators = [random.Generator(seed) for seed in seeds]
so I wouldn't waste flash space for having that in C. In fact, if you really want to, you an also hook into https://github.com/v923z/micropython-ulab/tree/master/snippets, and extend the C code with python
, pretty much as is done in numpy
.
PCG64 has a single integer multiplication, and bit shifts. Are you concerned about the time it takes to perform the multiplication?
Yes. I'm working with ESP32 which has an Xtensa LX6 CPU. I'm not sure exactly what configuration of Xtensa was licensed by Espressif for the chip but with the LX6 core, a hardware multiplier is optional, and when implemented it is either a 32-bit or a 16-bit hardware integer multiplier (16×16→32). IIRC these are not single cycle and not pipelined. PCG needs 64×64 multiplication which may require 16 times 16×16 or 4 times 32×32 (or perhaps half that if only the lower half of the result is needed). In contrast, XOR is single cycle and pipelined on almost any architecture. Perhaps I'm too worried about this...
Well, we've got to be compatible with numpy. Unfortunately, the documentation on this is rather sketchy, and we can't expect micropython to support the typing module, so I'm afraid Union and the like are out of question.
My bad, I was trying to be succinct so I used typing
annotations for communicating between us, not as a suggestion to actually use typing in the code.
To really be compatible with numpy you'd have to implement a BitGenerator
class. Then each Generator
instance would have to keep a reference to an instance of BitGenerator
, and would have to delegate to it every time it wants random bits.
This is great flexibility but too inefficient for an embedded scenario.
Here is the gist of my suggestion above, it makes things match numpy
almost exactly (slightly different from the previous suggestion):
Create a PCG64
class as in ulab
. Unlike numpy, this class will not actually be used to produce random bits. It will only be used to indicate the bit-generation algorithm, and optionally pass a seeded state.
The constructor for PCG64
follows the numpy guideline for general BitGenerator
, i.e. it takes one of the following:
None
(default), indicating some default seeding method, preferably using HW random or clock-based seedint
or a sequence (tuple?) of int
: numbers from which to generate the initial RNG state. Since different RNGs have different state sizes, there needs to be a mapping from these ints to the RNG state. Numpy uses a class called SeedSequence
. I think this is too much for embedded. I suggested truncation/repetition of the int sequence, as needed to fill the state buffer.
In any case, this only seeds a single generator, it is not meant to create multiple generator instances.This leaves an opening to later add different subtype of BitGenerator
s, e.g. Philox
or xoroshiro
. Again, for efficiency, they will not really be independent of Generator
. Instead, they only tell the Generator
constructor which algorithm to use, and optionally pass an initial state.
Inside the Generator
constructor, you examine the object passed. If it is one of the implemented BitGenerator
s, you initialize the Generator
by allocating an appropriately sized state buffer and setting a pointer to the algorithm's next
function that yields 32 random bits on request. Everything else in the implementation of Generator
is then agnostic to the BitGenerator
used.
I just read a little more in the numpy docs, apparently this is exactly what numpy does too, they don't delegate to the BitGenerator
via Python calls.
Instead, the concrete BitGenerator
(e.g. PCG64
) provides C function pointers that yield the next 32 bits (or more, in full blown numpy).
The Generator
class just stores a reference to the PCG64
object (to keep it alive). When it needs to use it, it fetches the function pointer (and probably a state buffer pointer), and then uses it to make C-level calls when it needs random bits:
PCG64
provides a capsule containing function pointers that produce doubles, and unsigned 32 and 64- bit integers. These are not directly consumable in Python and must be consumed by aGenerator
or similar object that supports low-level access.
Yes. I'm working with ESP32 which has an Xtensa LX6 CPU. I'm not sure exactly what configuration of Xtensa was licensed by Espressif for the chip but with the LX6 core, a hardware multiplier is optional, and when implemented it is either a 32-bit or a 16-bit hardware integer multiplier (16×16→32). IIRC these are not single cycle and not pipelined. PCG needs 64×64 multiplication which may require 16 times 16×16 or 4 times 32×32 (or perhaps half that if only the lower half of the result is needed). In contrast, XOR is single cycle and pipelined on almost any architecture. Perhaps I'm too worried about this...
We could replace the RNG, if you insist.
To really be compatible with numpy you'd have to implement a
BitGenerator
class. Then eachGenerator
instance would have to keep a reference to an instance ofBitGenerator
, and would have to delegate to it every time it wants random bits. This is great flexibility but too inefficient for an embedded scenario.
I don't really care about compatibility to the very last bit. It should be compatible in the standard use cases, and only the user-facing implementation. In fact, using a XOR-based RNG would already break compatibility, for numpy
has PCG64 as default.
- Create a
PCG64
class as inulab
. Unlike numpy, this class will not actually be used to produce random bits. It will only be used to indicate the bit-generation algorithm, and optionally pass a seeded state.
I don't want to implement everything, just because it's done so in numpy
. We have to keep in mind that MCUs are significantly smaller, in terms of flash, RAM, everything.
- Inside the
Generator
constructor, you examine the object passed. If it is one of the implementedBitGenerator
s, you initialize theGenerator
by allocating an appropriately sized state buffer and setting a pointer to the algorithm'snext
function that yields 32 random bits on request. Everything else in the implementation ofGenerator
is then agnostic to theBitGenerator
used.
Is a function pointer to an inline function going to work? I thought that you wanted to inline the function, so that you can save on execution time. Now, using a pointer to a function is not going to speed up the code.
We could replace the RNG, if you insist.
I don't. I'm just hoping we can keep it open to additions.
I don't really care about compatibility to the very last bit. It should be compatible in the standard use cases, and only the user-facing implementation. In fact, using a XOR-based RNG would already break compatibility, for numpy has PCG64 as default.
I agree. I would keep PCG64 and add an optional xoroshiro. If you leave it open to extension, I might find time to add that later myself (under a #if config). Still specifically in this case it seems that maintaining an identical user-facing interface is not difficult or expensive.
Is a function pointer to an inline function going to work? I thought that you wanted to inline the function, so that you can save on execution time. Now, using a pointer to a function is not going to speed up the code.
No, obviously a function called via a pointer cannot be inlined. Good point. I wonder what numpy does about that. Perhaps vectorization to avoid per-element call overhead.
Anyway, since there is no intention to make this open to adding BitGenerator
classes at runtime, there really isn't a need to use function pointers. The Generator
code could switch
on the actual type of BitGenerator
from the small finite set of implemented ones and make inlined calls to the appropriate function.
We could replace the RNG, if you insist.
I don't. I'm just hoping we can keep it open to additions.
I don't mind having multiple RNG implementations, and as you pointed out, performance could be platform-dependent.
I agree. I would keep PCG64 and add an optional xoroshiro. If you leave it open to extension, I might find time to add that later myself (under a #if config). Still specifically in this case it seems that maintaining an identical user-facing interface is not difficult or expensive.
No, that should be trivial. Either we simply decide this at compile time with the #if directive, or we find a way of switching between various implementations at run time. In any case, it would be great, if you could report on execution time/performance, when all the dust settles.
One thing to keep in mind, however, is that it's not only the flash for the RNG that you have to pay for this, but the loops, too. If you want to inline the RNG, then you've got to write out the loops for each case that you implement. So, you would have something like this (pseudocode):
if(RNG == PCG64) {
i = 0
while(i < ndarray.shape[3]) {
j = 0
while(j < ndarray.shape[2]) {
k = 0
while(k < ndarray.shape[1]) {
l = 0
while(l < ndarray.shape[0]) {
// PCG64 implementation here
*array = ...
array += ndarray.strides[0]
}
l++
array += ndarray.strides[1]
}
array += ndarray.strides[2]
k++
}
array += ndarray.strides[3]
j++
}
i++
} else if(RNG == xoroshiro) {
i = 0
while(i < ndarray.shape[3]) {
j = 0
while(j < ndarray.shape[2]) {
k = 0
while(k < ndarray.shape[1]) {
l = 0
while(l < ndarray.shape[0]) {
// xoroshiro implementation here
*array = ...
array += ndarray.strides[0]
}
l++
array += ndarray.strides[1]
}
k++
array += ndarray.strides[2]
}
j++
array += ndarray.strides[3]
}
i++
}
There is no other way of avoiding the function call overhead.
you've got to write out the loops for each case that you implement
That would be true if the inner operation was on the order of one or two cycles. Since we are dealing with next()
functions that are much more complex, we can sacrifice a branch in the inner loop. Using switch
is usually well optimized:
int const algo_index = algo_index_for_bitgen_type(figure_out_type(bitgen_argument));
void * const state_ptr = get_state_pointer(bitgen_argument);
while(...)
while (...)
while (...) {
register uint32_t bits;
switch(algo_index) {
case 0: bits = algo_0_next(state_ptr); break;
case 1: bits = algo_1_next(state_ptr); break;
...
}
*array = do_something_with(bits);
}
That would be true if the inner operation was on the order of one or two cycles. Since we are dealing with
next()
functions that are much more complex, we can sacrifice a branch in the inner loop.
OK.
The switch can actually always be placed outside the innermost while, which would amortize it better and might enable use of zero overhead loops:
while(...)
while (...)
while (...)
switch(algo_index) {
case 0: for(i=0; i<inner_extent; ++i, array += inner_stride) *array = do_something_with(algo_0_next(state_ptr)); break;
case 1: for(i=0; i<inner_extent; ++i, array += inner_stride) *array = do_something_with(algo_1_next(state_ptr)); break;
}
Also, you could (and perhaps already do) mark ndarrays as contiguous when they are, and then run lower dimensionality loops on them. That would save time on loop code when users run something on arrays of shape (1000, 2, 1) and stride (2, 1 ,1) by looping on shape (2000,) stride (1,) instead. But that is all second order, I don't think it's really necessary.
You can specify the seed now, an integer, or a tuple of integers.
Thanks. I'm going on vacation for a while. Will test this when I get back.
Hello All, I see the there was quite a progress here 6 months ago any expectations when it could reach closed/prebuilt stage? I would love to use this.
I think the only outstanding issue is the out
keyword argument. In the case of random
that's rather trivial, so I might be able to add that today, and create a PR.
Awesome, I don't want to rush anything so I'll patiently wait for "random". Glad that this is not dead. Thanks! ;)
I think https://github.com/v923z/micropython-ulab/pull/654/ is pretty close to what you wanted, so we could in principle merge that, and add new features in separate PRs.
Amazing work, yeah that would unblock me for a start if I will find anything missing I will let you know but this is HUGE! Thanks a lot!
Completed in https://github.com/v923z/micropython-ulab/pull/654.
The module is useful for machine learning related applications. I would be glad if someone could add the np.random.randn function since I dont know how to do it in C.