v923z / micropython-ulab

a numpy-like fast vector module for micropython, circuitpython, and their derivatives
https://micropython-ulab.readthedocs.io/en/latest
MIT License
430 stars 117 forks source link

[FEATURE REQUEST] Add np.random Module #569

Closed MichaelRinger closed 10 months ago

MichaelRinger commented 1 year ago

The module is useful for machine learning related applications. I would be glad if someone could add the np.random.randn function since I dont know how to do it in C.

v923z commented 1 year ago

Do you need only the random.randn function? I would actually go for random.standard_normal, and wrap it with random.randn, as is done in numpy. See also the comments in Quick start.

MichaelRinger commented 1 year ago

Atm I only the random.randn function. Yeah creating the random.standard_normal and add the alias random.randn, like its done in numpy, is better.

v923z commented 1 year ago

I would vote for the new nomenclature in https://numpy.org/doc/stable/reference/random/generator.html#numpy.random.Generator.

However, a more fundamental question is what we should do with the None seed.

If None, then fresh, unpredictable entropy will be pulled from the OS.

Where should we pull this information from, if there is no underlying OS?

MichaelRinger commented 1 year ago

There already is a function random.seed(n=None) in micropythons random standard library that generates a seed, by getting a random number from the hardware.

When no argument (or None) is passed in it will (if supported by the port) initialise the PRNG with a true random number (usually a hardware generated random number).

The None case only works if MICROPY_PY_URANDOM_SEED_INIT_FUNC is enabled by the port, otherwise it raises ValueError.

v923z commented 1 year ago

OK, but if that is the case, then you can already use this via the vectorize method. Is that prohibitively expensive?

Another question is, if we were to re-implement this in ulab, what would happen, if a particular port does not support a hardware random number generator?

v923z commented 1 year ago

@MichaelRinger Do you have comments on the second question?

MichaelRinger commented 1 year ago

@MichaelRinger Do you have comments on the second question? I am using the vectorize method of numpy in combination with the base random func in micropython to solve. Thanks for the help.

v923z commented 1 year ago

@MichaelRinger Did you close this by accident? I was actually going to add the module...

ammar-bitar commented 1 year ago

Hi @v923z , sorry to bump this, did you end up adding the module? can't seem to find it

v923z commented 1 year ago

@ammar-bitar No, I haven't added, because the OP closed the issue, and there was no response to the question after that. Do you need the module? If so, which methods? It would not be too much work to add, but it was not clear to me, whether there is interest.

mcskatkat commented 1 year ago

I'd like to upvote this. I think numpy is half crippled without a vectorized RNG. The micropython random module is a usable workaround, only when speed is not an issue.

Not sure though which RNG has decent entropy while being small enough in code and work memory sizes.

Specifically the ESP32 that I'm interested in has a hardware RNG that according to the docs yields 32-bit integers in <100 CPU cycles.

v923z commented 1 year ago

micropython uses yasmarang, which fares quite well in random number tests, as is cheap.

I opened a discussion some time ago here: https://github.com/orgs/micropython/discussions/10550, and I think I could pick up the thread from there, and implement the module in a reasonable time.

In the meantime, as suggested above, you can use the vectorize function. That could easily save you a factor of two in execution time.

mcskatkat commented 1 year ago

Thanks. I wasn't aware of vectorize.

As for yasmarang, I think it really is too spartan... a 32 bit state is not enough, even for a microcontroller. Especially seeing as in the other thread you were worried about quality (of seeding).

I think that any hardware capable of running a python-like interpreter with per-object overhead and a GC, can afford a few 10s of bytes for an RNG with several 100 bits of state. Even more so if the system is capable of holding ndarrays. No need to be frugal about it.

If you do end up implementing an RNG, I implore you to forget micropython's poor choice of yasmarang and consider the much more modern xoroshiro, that has a 32-bit variant with a 128-bit state and a 64-bit variant with a 256-bit state, useful for the fortunate who have double precision floats.

Looking at the next() function in the xoroshiro source, it looks like code size is comparable to yasmarang, and can be tuned with choice of inlining and constant folding. On a microcontroller, the sub-ns/bit performance they promise is not attainable, but it will still be fast.

Also I would advise against taking micropython's (following original Python's) path of having one global static generator state. If we already paid the code size overhead, why hardwire it to a single piece of memory...

Thank you so much for the effort you put into ulab.

v923z commented 1 year ago

As for yasmarang, I think it really is too spartan... a 32 bit state is not enough, even for a microcontroller. Especially seeing as in the other thread you were worried about quality (of seeding).

OK, then I would start here https://www.pcg-random.org/.

I think that any hardware capable of running a python-like interpreter with per-object overhead and a GC, can afford a few 10s of bytes for an RNG with several 100 bits of state. Even more so if the system is capable of holding ndarrays. No need to be frugal about it.

If you do end up implementing an RNG, I implore you to forget micropython's poor choice of yasmarang and consider the much more modern xoroshiro, that has a 32-bit variant with a 128-bit state and a 64-bit variant with a 256-bit state, useful for the fortunate who have double precision floats.

I think the actual choice of the RNG is irrelevant from the viewpoint of implementation effort. If you don't like a particular RNG, we can always replace it later by a better one. But could we use https://github.com/numpy/numpy/blob/3d8cfd68d8df1af4e25e8dffd0a155fda20dc770/numpy/random/src/pcg64/pcg64.c#L64 as a starting point?

Also I would advise against taking micropython's (following original Python's) path of having one global static generator state. If we already paid the code size overhead, why hardwire it to a single piece of memory...

Again, I feel that this is a minor implementation detail that we could change, if necessary. Do you happen to know how numpy deals with the question?

Also, do you need distributions, too, or just a uniform RNG? Which method of the 'random` module would be important in your application?

mcskatkat commented 1 year ago

I think the actual choice of the RNG is irrelevant from the viewpoint of implementation effort. If you don't like a particular RNG, we can always replace it later by a better one. But could we use https://github.com/numpy/numpy/blob/3d8cfd68d8df1af4e25e8dffd0a155fda20dc770/numpy/random/src/pcg64/pcg64.c#L64 as a starting point?

You are right of course about the actual RNG being immaterial to the ulab code that provides the user facing functionality. The problem with PCG specifically is that at its heart are 64-bit by 64-bit multiplications used to implement even larger 128 bit multiplications. This is not cheap for microcontrollers that may struggle to get 32 bit multiplications done quickly. XOR based RNGs are cheap on virtually any hardware.

Anyway as you said the RNG itself is interchangeable and you can also choose to support multiple and have them compiled selectively via configuration file for code size efficiency (see BitGenerator below).

Again, I feel that this is a minor implementation detail that we could change, if necessary.

True again. But change after deployment is itself expensive. Your discussion thread in the micropython repo proves just that. There is nothing in the yasmarang and seeding code in micropython that prevents your desired use pattern (obtain random seed) or that prevents using yasmarang with multiple generator states instead of a single static one. But their choice not to export the proper interface makes this later extension difficult.

Do you happen to know how numpy deals with the question?

Sure. Old numpy.random had the RandomState class that you could instantiate as many times as you need to create new generator states. This is now considered "legacy".

Modern numpy.random revised this (at the painful cost of breaking compatibility, see above, cost of change after deployment) so that there is clear architectural separation between generating random bits (BitGenerator class) and the subsequent generation of distributions from those bits (Generator class). The discussion we had above about yasmarang, xoroshiro and PCG is all about BitGenerators. And you can choose which one(s) to implement.

The whole point of the Generator class is to be invariant to BitGenerator. And the nice thing is that the ulab version of Generator can start narrow and get expanded later as needed. I doubt many microcontroller users will find the dirichlet function necessary.

I would start with the following Generator methods:

In all of the above I think it's also important to support the out= keyword argument, so that memory allocation is not forced on users who already own some memory and just want some computation done into it.

BTW I think this out= feature is generally missing in ulab.

If you feel like being even more generous to users, these are the second tier methods. Useful, but not as often as the ones above, and require a lot more code. I think they are entirely optional in a microcontroller environment:

v923z commented 1 year ago

You are right of course about the actual RNG being immaterial to the ulab code that provides the user facing functionality. The problem with PCG specifically is that at its heart are 64-bit by 64-bit multiplications used to implement even larger 128 bit multiplications. This is not cheap for microcontrollers that may struggle to get 32 bit multiplications done quickly. XOR based RNGs are cheap on virtually any hardware.

OK, this is a good point.

Anyway as you said the RNG itself is interchangeable and you can also choose to support multiple and have them compiled selectively via configuration file for code size efficiency (see BitGenerator below).

Yes, this would be a reasonable approach. Thich could even be done automatically, based on whether you have floats or doubles on a microcontroller.

True again. But change after deployment is itself expensive. Your discussion thread in the micropython repo proves just that. There is nothing in the yasmarang and seeding code in micropython that prevents your desired use pattern (obtain random seed) or that prevents using yasmarang with multiple generator states instead of a single static one. But their choice not to export the proper interface makes this later extension difficult.

I understand that, and there is no disagreement in this regard. But we don't have to stick with micropython's interface, in fact, we should conform to numpy's. I would, perhaps, suggest that we hash out the interface first, and settle on the particular generator later.

I would start with the following Generator methods:

  • .integers()
  • .random(), the classic [0.0, 1.0)
  • .bytes(), is essentially the same as .integers(), possibly slightly more efficient
  • .normal(), is slightly harder and usually twice the effort of .random()
  • .uniform() is a minor extension of .random()

OK.

In all of the above I think it's also important to support the out= keyword argument, so that memory allocation is not forced on users who already own some memory and just want some computation done into it. BTW I think this out= feature is generally missing in ulab.

I agree. You're not the first, who noticed this: https://github.com/v923z/micropython-ulab/issues/592

Just out of curiosity, what is your use case? One reason I didn't originally consider the random module is that it's hard to test on a microcontroller, given that you can't easily plot/analyse generated data.

mcskatkat commented 1 year ago

My use case is quite trivial, I just need some randomness for real-time graphics with many objects moving on the screen.

This is roughly similar to game programming, where to maintain a reasonable frame rate I need to be quick about computing many object state updates. Without the array processing that ulab allows, using Python loops it would probably be way too slow. So ulab allows me to maintain states for 100s of tiny objects, with floating point precision, and still have enough time to convert final screen positions to integers and render objects.

So for my use case, just having .uniform() is enough. .normal() would be nice to have.

One reason I didn't originally consider the random module is that it's hard to test on a microcontroller, given that you can't easily plot/analyse generated data.

I don't see why you'd need to plot / analyse, and you definitely don't have to do that on-device.

The thing with PRNGs is that you can easily seed them with a known value, generate some sequence data and store it to a file on an SD card. If you want to be really fancy you can send that out over Ethernet or Wifi if so equipped. This data just needs to be bit-identical to the well known PRNG sequence, which can be generated on a laptop/desktop machine from the PRNG's reference code.

As for the Generator part of it, you create a BitGenerator subclass that produces a pre-stored or known-seed sequence. Feed that into numpy's internal Generator class and get reference results. Then you do the same with the ulab implementation of Generator, again save that to a file and analyze on a "real" computer. Results need to be identical down to floating point roundoff errors, where applicable.

I think this would be practical to test up to sequences of several 100GB using a cheap SD card and a lot of patience to wait. I'm not sure it's necessary to test large volumes of data. Once a PRNG follows the sequence, it will not stray from it, unless its code is insanely written.

For the Generator part of things that deals with floating point - basically what's required is porting the code in numpy's Generator from Python/C to micropython-friendly C. There should be no interface or algorithm changes whatsoever. Just thinning down to supported methods and dtypes. In testing you may want to ensure that some floating point edge cases (denormalized, infinities, boundaries etc) are reached, where relevant. But I think for a first version there is not even a need to bother with that.

v923z commented 1 year ago

My use case is quite trivial, I just need some randomness for real-time graphics with many objects moving on the screen.

This is roughly similar to game programming, where to maintain a reasonable frame rate I need to be quick about computing many object state updates.

In that case, the quality of the generator is not really an issue, I guess. In any case, here is a first implementation. https://github.com/v923z/micropython-ulab/tree/random The out keyword is not yet supported, but you can do


MicroPython v1.20.0-311-gea1a5e43d on 2023-07-23; linux [GCC 12.2.0] version
Use Ctrl-D to exit, Ctrl-E for paste mode
>>> from ulab import numpy as np
>>> rng = np.random.Generator()
>>> rng.random(12)
array([1.688118533849092e-10, 0.7703293013781235, 0.6074833036730274, ..., 0.2697689388821915, 0.6484892791277977, 0.6955782450470177], dtype=float64)
>>> rng.random(size=(5, 5))
array([[0.7724060907456224, 0.1599236337293051, 0.7339406510831528, 0.5274410894040502, 0.6714481027287427],
       [0.6492265779888077, 0.06322272656419292, 0.4808784435223349, 0.8659584576783896, 0.6510361989559311],
       [0.7035330373879642, 0.1143488863528934, 0.3228607724092158, 0.9696441515163603, 0.3353713223418759],
       [0.4890397567261386, 0.8449387836308111, 0.4607909039388876, 0.1653575696739978, 0.4966210702938057],
       [0.3707932026109159, 0.7451473233433786, 0.1918072763583848, 0.7469201739384557, 0.7688914647671239]], dtype=float64)
mcskatkat commented 1 year ago

Great. That was fast... Looks good.

This certainly goes the numpy way about having as many generator states as needed. Just need a way to seed them differently from each other :-)

And while you're right that the quality of the generator is not an issue, its speed is, so I would love to have an option to use an XOR based BitGenerator under the hood of Generator in addition to the PCG multiplier based one. Doesn't need to go all the way with separating to two classes as in numpy.

Something like np.random.Generator(seed: Union[None, int32, tuple[int32,...]], bitgen: Literal['pcg', 'xoroshiro'] = 'pcg') The seed tuple is either truncated or repeated to match the BitGenerator's state size. When seed is an int, treat it as a tuple of length 1. When seed is None I'm not sure what to do... perhaps require it not to be. This leaves the problem of proper seeding to the user, who probably knows better what is 'proper' for the application.

v923z commented 1 year ago

Great. That was fast... Looks good.

This certainly goes the numpy way about having as many generator states as needed. Just need a way to seed them differently from each other :-)

Sure. I haven't yet had time to add that, but it should be trivial.

And while you're right that the quality of the generator is not an issue, its speed is, so I would love to have an option to use an XOR based BitGenerator under the hood of Generator in addition to the PCG multiplier based one. Doesn't need to go all the way with separating to two classes as in numpy.

PCG64 has a single integer multiplication, and bit shifts. Are you concerned about the time it takes to perform the multiplication?

Something like np.random.Generator(seed: Union[None, int32, tuple[int32,...]], bitgen: Literal['pcg', 'xoroshiro'] = 'pcg') The seed tuple is either truncated or repeated to match the BitGenerator's state size. When seed is an int, treat it as a tuple of length 1. When seed is None I'm not sure what to do... perhaps require it not to be. This leaves the problem of proper seeding to the user, who probably knows better what is 'proper' for the application.

Well, we've got to be compatible with numpy. Unfortunately, the documentation on this is rather sketchy, and we can't expect micropython to support the typing module, so I'm afraid Union and the like are out of question. I'd also like to be flash-aware. What you suggest can be done in a simple for loop in python as

seeds = [1, 2, 3, 4, 5]
generators = [random.Generator(seed) for seed in seeds]

so I wouldn't waste flash space for having that in C. In fact, if you really want to, you an also hook into https://github.com/v923z/micropython-ulab/tree/master/snippets, and extend the C code with python, pretty much as is done in numpy.

mcskatkat commented 1 year ago

PCG64 has a single integer multiplication, and bit shifts. Are you concerned about the time it takes to perform the multiplication?

Yes. I'm working with ESP32 which has an Xtensa LX6 CPU. I'm not sure exactly what configuration of Xtensa was licensed by Espressif for the chip but with the LX6 core, a hardware multiplier is optional, and when implemented it is either a 32-bit or a 16-bit hardware integer multiplier (16×16→32). IIRC these are not single cycle and not pipelined. PCG needs 64×64 multiplication which may require 16 times 16×16 or 4 times 32×32 (or perhaps half that if only the lower half of the result is needed). In contrast, XOR is single cycle and pipelined on almost any architecture. Perhaps I'm too worried about this...

Well, we've got to be compatible with numpy. Unfortunately, the documentation on this is rather sketchy, and we can't expect micropython to support the typing module, so I'm afraid Union and the like are out of question.

My bad, I was trying to be succinct so I used typing annotations for communicating between us, not as a suggestion to actually use typing in the code.

To really be compatible with numpy you'd have to implement a BitGenerator class. Then each Generator instance would have to keep a reference to an instance of BitGenerator, and would have to delegate to it every time it wants random bits. This is great flexibility but too inefficient for an embedded scenario.

Here is the gist of my suggestion above, it makes things match numpy almost exactly (slightly different from the previous suggestion):

mcskatkat commented 1 year ago

I just read a little more in the numpy docs, apparently this is exactly what numpy does too, they don't delegate to the BitGenerator via Python calls.

Instead, the concrete BitGenerator (e.g. PCG64) provides C function pointers that yield the next 32 bits (or more, in full blown numpy). The Generator class just stores a reference to the PCG64 object (to keep it alive). When it needs to use it, it fetches the function pointer (and probably a state buffer pointer), and then uses it to make C-level calls when it needs random bits:

PCG64 provides a capsule containing function pointers that produce doubles, and unsigned 32 and 64- bit integers. These are not directly consumable in Python and must be consumed by a Generator or similar object that supports low-level access.

v923z commented 1 year ago

Yes. I'm working with ESP32 which has an Xtensa LX6 CPU. I'm not sure exactly what configuration of Xtensa was licensed by Espressif for the chip but with the LX6 core, a hardware multiplier is optional, and when implemented it is either a 32-bit or a 16-bit hardware integer multiplier (16×16→32). IIRC these are not single cycle and not pipelined. PCG needs 64×64 multiplication which may require 16 times 16×16 or 4 times 32×32 (or perhaps half that if only the lower half of the result is needed). In contrast, XOR is single cycle and pipelined on almost any architecture. Perhaps I'm too worried about this...

We could replace the RNG, if you insist.

To really be compatible with numpy you'd have to implement a BitGenerator class. Then each Generator instance would have to keep a reference to an instance of BitGenerator, and would have to delegate to it every time it wants random bits. This is great flexibility but too inefficient for an embedded scenario.

I don't really care about compatibility to the very last bit. It should be compatible in the standard use cases, and only the user-facing implementation. In fact, using a XOR-based RNG would already break compatibility, for numpy has PCG64 as default.

  • Create a PCG64 class as in ulab. Unlike numpy, this class will not actually be used to produce random bits. It will only be used to indicate the bit-generation algorithm, and optionally pass a seeded state.

I don't want to implement everything, just because it's done so in numpy. We have to keep in mind that MCUs are significantly smaller, in terms of flash, RAM, everything.

  • Inside the Generator constructor, you examine the object passed. If it is one of the implemented BitGenerators, you initialize the Generator by allocating an appropriately sized state buffer and setting a pointer to the algorithm's next function that yields 32 random bits on request. Everything else in the implementation of Generator is then agnostic to the BitGenerator used.

Is a function pointer to an inline function going to work? I thought that you wanted to inline the function, so that you can save on execution time. Now, using a pointer to a function is not going to speed up the code.

mcskatkat commented 1 year ago

We could replace the RNG, if you insist.

I don't. I'm just hoping we can keep it open to additions.

I don't really care about compatibility to the very last bit. It should be compatible in the standard use cases, and only the user-facing implementation. In fact, using a XOR-based RNG would already break compatibility, for numpy has PCG64 as default.

I agree. I would keep PCG64 and add an optional xoroshiro. If you leave it open to extension, I might find time to add that later myself (under a #if config). Still specifically in this case it seems that maintaining an identical user-facing interface is not difficult or expensive.

Is a function pointer to an inline function going to work? I thought that you wanted to inline the function, so that you can save on execution time. Now, using a pointer to a function is not going to speed up the code.

No, obviously a function called via a pointer cannot be inlined. Good point. I wonder what numpy does about that. Perhaps vectorization to avoid per-element call overhead.

Anyway, since there is no intention to make this open to adding BitGenerator classes at runtime, there really isn't a need to use function pointers. The Generator code could switch on the actual type of BitGenerator from the small finite set of implemented ones and make inlined calls to the appropriate function.

v923z commented 1 year ago

We could replace the RNG, if you insist.

I don't. I'm just hoping we can keep it open to additions.

I don't mind having multiple RNG implementations, and as you pointed out, performance could be platform-dependent.

I agree. I would keep PCG64 and add an optional xoroshiro. If you leave it open to extension, I might find time to add that later myself (under a #if config). Still specifically in this case it seems that maintaining an identical user-facing interface is not difficult or expensive.

No, that should be trivial. Either we simply decide this at compile time with the #if directive, or we find a way of switching between various implementations at run time. In any case, it would be great, if you could report on execution time/performance, when all the dust settles.

One thing to keep in mind, however, is that it's not only the flash for the RNG that you have to pay for this, but the loops, too. If you want to inline the RNG, then you've got to write out the loops for each case that you implement. So, you would have something like this (pseudocode):

if(RNG == PCG64) {
    i = 0
    while(i < ndarray.shape[3]) {
        j = 0
           while(j < ndarray.shape[2]) {
               k = 0
               while(k < ndarray.shape[1]) {
                   l = 0
                   while(l < ndarray.shape[0]) {
                       // PCG64 implementation here
                       *array = ... 
                       array += ndarray.strides[0]
                   }
                   l++
                   array += ndarray.strides[1]
               }
               array += ndarray.strides[2]
               k++
           }
           array += ndarray.strides[3]
           j++
    }
    i++
} else if(RNG == xoroshiro) {
    i = 0
    while(i < ndarray.shape[3]) {
        j = 0
           while(j < ndarray.shape[2]) {
               k = 0
               while(k < ndarray.shape[1]) {
                   l = 0
                   while(l < ndarray.shape[0]) {
                       // xoroshiro implementation here
                       *array = ... 
                       array += ndarray.strides[0]
                   }
                   l++
                    array += ndarray.strides[1]
               }
               k++
               array += ndarray.strides[2]
           }
           j++
           array += ndarray.strides[3]
    }
    i++
}

There is no other way of avoiding the function call overhead.

mcskatkat commented 1 year ago

you've got to write out the loops for each case that you implement

That would be true if the inner operation was on the order of one or two cycles. Since we are dealing with next() functions that are much more complex, we can sacrifice a branch in the inner loop. Using switch is usually well optimized:

int const algo_index = algo_index_for_bitgen_type(figure_out_type(bitgen_argument));
void * const state_ptr = get_state_pointer(bitgen_argument);
while(...)
  while (...)
    while (...) {
        register uint32_t bits;
        switch(algo_index) {
            case 0: bits = algo_0_next(state_ptr); break;
            case 1: bits = algo_1_next(state_ptr); break;
            ...
        }
        *array = do_something_with(bits);
    }
v923z commented 1 year ago

That would be true if the inner operation was on the order of one or two cycles. Since we are dealing with next() functions that are much more complex, we can sacrifice a branch in the inner loop.

OK.

mcskatkat commented 1 year ago

The switch can actually always be placed outside the innermost while, which would amortize it better and might enable use of zero overhead loops:

while(...)
  while (...)
    while (...)
        switch(algo_index) {
            case 0:  for(i=0; i<inner_extent; ++i, array += inner_stride) *array = do_something_with(algo_0_next(state_ptr)); break; 
            case 1:  for(i=0; i<inner_extent; ++i, array += inner_stride) *array = do_something_with(algo_1_next(state_ptr)); break; 
        }

Also, you could (and perhaps already do) mark ndarrays as contiguous when they are, and then run lower dimensionality loops on them. That would save time on loop code when users run something on arrays of shape (1000, 2, 1) and stride (2, 1 ,1) by looping on shape (2000,) stride (1,) instead. But that is all second order, I don't think it's really necessary.

v923z commented 1 year ago

You can specify the seed now, an integer, or a tuple of integers.

mcskatkat commented 1 year ago

Thanks. I'm going on vacation for a while. Will test this when I get back.

JanekBaumgartRelativity commented 10 months ago

Hello All, I see the there was quite a progress here 6 months ago any expectations when it could reach closed/prebuilt stage? I would love to use this.

v923z commented 10 months ago

I think the only outstanding issue is the out keyword argument. In the case of random that's rather trivial, so I might be able to add that today, and create a PR.

JanekBaumgartRelativity commented 10 months ago

Awesome, I don't want to rush anything so I'll patiently wait for "random". Glad that this is not dead. Thanks! ;)

v923z commented 10 months ago

I think https://github.com/v923z/micropython-ulab/pull/654/ is pretty close to what you wanted, so we could in principle merge that, and add new features in separate PRs.

JanekBaumgartRelativity commented 10 months ago

Amazing work, yeah that would unblock me for a start if I will find anything missing I will let you know but this is HUGE! Thanks a lot!

v923z commented 10 months ago

Completed in https://github.com/v923z/micropython-ulab/pull/654.