ValueError: size must be a divisor of the total size in bytes of the last axis of the array

moble / quaternion

Add built-in support for quaternions to numpy

MIT License

609 stars 85 forks source link

ValueError: size must be a divisor of the total size in bytes of the last axis of the array #85

Closed BEpresent closed 6 years ago

BEpresent commented 6 years ago

I have a numpy array with dimensions (4, 364, 3969), and try to convert it into a quat_array, e.g.,

quat.as_quat_array(np.random.rand(4, 364, 3969) )

and get the error

------------------------------------------------------------
ValueError                 Traceback (most recent call last)
<ipython-input-35-79d845f77eba> in <module>()
----> 1 quat.as_quat_array(testarray)

~/miniconda3/lib/python3.6/site-packages/quaternion/__init__.py in as_quat_array(a)
     90     if not a.flags['C_CONTIGUOUS'] or a.strides[-1] != a.itemsize:
     91         a = a.copy(order='C')
---> 92     av = a.view(np.quaternion)
     93 
     94     # special case - don't create an axis for a single quaternion, to match

ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.

Interestingly,

quat.as_quat_array(np.random.rand(4, 364, 3968) ) works, (4, 364, 3964) does also, quat.as_quat_array(np.random.rand(4, 364, 3967) ) does not, so it seems that the last dimension must ba divisible by 4.

The same applies for larger arrays, e.g.

quat.as_quat_array(np.random.rand(4, 364, 3964,16) ) works

quat.as_quat_array(np.random.rand(4, 364, 3964,15) ) does not.

The workaround would probably be to design only arrays with last dimensions divisible by 4 (as the error suggests, "its size must be a divisor of the total size in bytes of the last axis of the array.").

Is there another way to handle this behavior? Thank you!

BEpresent commented 6 years ago

Update:

Transposing the array, so that the last dimension has size 4, is another solution.

quat.as_quat_array((np.random.rand(4, 364, 3969).transpose([1,2,0]) )) works

I assumed that .as_quat_array would automatically notice that one of the dimensions already has size 4. May be an axis keyword would be a solution (e.g. quat.as_quat_array(np.random.rand(4, 364, 3969), axis=0) ) ?

The problem with an array of size (364, 3969, 4) (slower) vs (4, 364, 3969) (faster) is the performance difference in several computation steps before as_quat_array is called, probably because of the array's strides.

Even transposing is slow, since it creates a complete new copy of the array (correct me if I am wrong), so the shape of the array may be of importance if performance is critical.

-Another question: is it possible to change the components of a quat_array in a vectorized way?

e.g.


quat.as_float_array(quatarray)[:,:,1:3]*=const # for the x, and y components of each quaternion
quat.as_float_array(quatarray)[:,:,-1] += anotherconst # for the z component of each quaternion

So my idea would be to convert to a float_array and then to change the columns by element-wise operations. I am not sure if this is the fastest way or if there is already a built-in function to accomplish this?

Thank you for the great library.

moble commented 6 years ago

Yes, as it says in the docstring for as_quat_array, the input array must have a final dimension whose size is divisible by four. The reason is that this function is essentially a wrapper around the numpy view method. So the intent is that you leave data in place (if possible), but just reinterpret it as a series of quaternions. But quaternions are represented by four consecutive floats, so their corresponding entries in the input array must also be consecutive. In a C-contiguous array (the most common type), that means that as you increment the index in the last dimension, you move to the next element of the quaternion. One way or another, memory will have to be copied (which is slow) if your quaternion components are not consecutive. Moreover, as you also seem to understand, the output from as_float_array has the same dimensions and sizes, but an extra axis of size 4 at the end of the shape. So these two functions are basically inverses of each other.

The fact that quaternions are represented as four consecutive floats is absolutely necessary for speed. There's really no getting around this. In fact, it's hard coded in C — a choice has to be made, and that's the most sensible one for most purposes. The reason it's the sensible choice is that the vast majority of operations that use quaternions require knowledge of all four components simultaneously.

Which brings me to your proposed operations at the end of your last comment. That would be the fastest way to accomplish those operations, and I doubt that even a custom-coded C function could do it significantly faster for large arrays. It would also be possible to add to the z component with something like quatarray += anotherconst * quaternion.z. The only thing I'd caution you about is that you should be sure that you actually want such an operation. It's unusual to find a useful application of something other than one of the basic algebraic operations on quaternions — addition, subtraction, multiplication, division, exponentiation, logarithm, and closely related things like inversion and square-root. These are available already. In fact, there aren't even many useful applications for quaternions apart from rotations, which typically only require the group structure of the unit quaternions (multiplication and relatives), as opposed to the full algebraic structure. I can't think of any good reason to multiply just the x and y components by some number.

BEpresent commented 6 years ago

In fact, there aren't even many useful applications for quaternions apart from rotations, which typically only require the group structure of the unit quaternions (multiplication and relatives), as opposed to the full algebraic structure. I can't think of any good reason to multiply just the x and y components by some number.

This is true, I try to use quaternions for rotations.

I try to simulate a decaying rotation vector, that decays with different rates for each individual component at each time-step. This is why I would like scale the x, y and z components independently. E.g. a decaying magnetization vector of a spin system shows exactly this kind of behavior.

My hope is to simulate the rotations with quaternions faster than with other approaches, in a vectorized way for thousands of positions.

I am not sure, if my approach quatarray += anotherconst * quaternion.z would be the most efficient way, or if my desired behavior could even be accomplished more efficiently with some attributes of quaternions that I am not aware of?

moble commented 6 years ago

I try to simulate a decaying rotation vector, that decays with different rates for each individual component at each time-step. This is why I would like scale the x, y and z components independently. E.g. a decaying magnetization vector of a spin system shows exactly this kind of behavior.

I'm not sure what you're saying here — and specifically what you mean by "rotation vector", and how you're trying to relate that to the magnetization vector, especially since the magnetization vector is not only rotating but also changing length. Presumably, you're not just expressing the magnetization vector as the vector part of a quaternion, since that wouldn't be of any help. So I'm left to guess at what else you might mean.

It obviously doesn't make sense for the unit quaternion representing a rotation to decay. It might make sense for components of the rotation axis to "decay", but this isn't very physical: because of non-commutativity, complicated rotations tend to decompose into products of simpler rotations, rather than the quaternion's components decomposing in any sensible way. For example, my vague recollection of NMR physics suggests that a better model would be an overall rotation about the static magnetic-field direction (say z) at the Larmor frequency, operating on another rotation describing the effect of some pulse orthogonal to z and/or thermal decay, spin-spin interaction, etc. But when you do that multiplication, you don't get a quaternion where the z component oscillates at the Larmor frequency (or increases uniformly at a related rate). Instead, you get a more complicated result. If that's what you're trying to do, I would recommend actually factoring out the different effects. It may even be possible to plug such an ansatz into your equations, and literally factor out the rotation for the Larmor precession for example, leaving you to only need to simulate a somewhat slower (and thus easier-to-evolve) effect.

Alternatively, it is actually possible to model a vector that is not only rotating but also changing length by using non-unit quaternions: instead of conjugating some constant vector v with some time-dependent unit quaternion R according to R * v * R.inverse(), you "sandwich" it with a quaternion Q according to Q * v * Q.conjugate(), where the magnitude of result is just the square of the magnitude of Q. But here again, the components of Q change in complicated ways for any non-trivial motion due to the non-commutativity of quaternion multiplication.

In any case, the upshot is that if you really want to manipulate individual components of a bunch of quaternions at a time, there's no faster way than just converting to a numpy array of floats, operating on some axis, and then converting back to quaternions.

BEpresent commented 6 years ago

Please excuse the delayed answer and thank you for your response! Please also excuse my lack of clarity, let me explain my intent a bit:

It's about NMR, exactly. As I already have a verified (Bloch-)simulator of such systems, my main goal is an increase in calculation performance. The magnetization vector's x and y components decay at each time-step and its z component gains at each time-step, indeed due to various energy losses, spin-spin interactions and relaxation processes.

The simulation happens within the frame of reference that co-rotates at the Larmor frequency, so it is essentially a frame at rest, which makes stuff easier (just a rotation from the vertical into the transverse plane, without high frequency oscillations).

That means for each time-step we have the rotated vector

v_rot = Q * v_rot_init * Q.conjugate(), 
#scale it's components (after converting to a numpy array):
v_rot_numpy = quat.as_float_array(v_rot)
v_rot_numpy[:,1:3]*=decay_constant #scale x, and y components
v_rot_numpy[:,3]+= (1-v_rot_numpy[:,3])*relaxation_constant  #add relaxation for z component
# convert back to quat array
v_rot_init = quat.as_quat_array(v_rot_numpy)

Then repeat this for certain number of time-steps.

First, this is already faster than my previous approach, which is nice! Profiling this shows, that the whole scaling process, i.e. the lines

v_rot_numpy = quat.as_float_array(v_rot)
v_rot_numpy[:,1:3]*=decay_constant #scale x, and y components
v_rot_numpy[:,3]+= (1-v_rot_numpy[:,3])*relaxation_constant  #add relaxation for z component
v_rot_init = quat.as_quat_array(v_rot_numpy)

take quite a bit of computation time and amount to ~ 50% of the whole compuation time.

Your point about non-unit quaternions is quite interesting, as I never used R * v * R.inverse() (what is this operation reverse?), I always use Q * v *np.conjugate(Q) without thinking much about it. After this operation the resulting rotated vector v_rot is itself an array of quaternions, which is why I try to change its individual quaternion components to model the spin-relaxation/decay processes. Do I understand correctly, that I could use some non-unit quaternions to model such a behavior instead?

moble commented 6 years ago

Ah, I see. I have a few suggestions that may improve your code's performance.

First, your previous comments and the use of expressions like v_rot_numpy[:,1:3] suggests that you have many vectors to rotate at the same time by the same Q. In that case, it's actually faster (and just as accurate) to convert the quaternions to a rotation matrix, and just rotate in the usual way with matrix multiplication. This means that you won't have to convert your various v vectors to and from quaternions; instead you can just use the standard matrix multiplication. So you could replace that entire first block of code in your last comment with

v_rot = (quat.as_rotation_matrix(Q) @ v_rot_init[..., np.newaxis])[..., 0]
v_rot[:, 0:2] *= decay_constant  # scale x and y components
v_rot[:, 2] += (1 - v_rot[:, 2]) * relaxation_constant  # add relaxation for z components

Here, I'm assuming that v_rot_init is not a quaternion array, but just an Nx3 float array. Depending on how many of these vectors you've got around, this approach to rotation will be somewhere around 40% faster, and won't require conversion to and from quaternion arrays. [However, note that you still probably want to use quaternions to create the rotation matrices. Trying to construct rotation matrices in other ways generally leads to very poor results. While it's safe to convert from a quaternion to a matrix, you shouldn't go the other way.]

Second, the part where you're manipulating the individual components is in pure numpy, which is good, but might be able to improve. In particular, the problem with numpy is that it sometimes makes more loops through memory than necessary, and makes more copies of data than necessary. If those two lines are still one of your biggest slowdowns, you might try using numba, which avoids those problems. In this case, you'd define a function like this:

import numba as nb

@nb.njit
def update_v_rot(v_rot):
    for i in range(v_rot.shape[0]):
        v_rot[i, 0] *= decay_constant  # scale x component
        v_rot[i, 1] *= decay_constant  # scale y component separately for speed
        v_rot[i, 2] += (1 - v_rot[i, 2]) * relaxation_constant  # add relaxation for z component

update_v_rot(np.random.rand(2, 3))  # optionally, ensure the function compiles before using it in real code

Then the block of code would just be

v_rot = (quat.as_rotation_matrix(Q) @ v_rot_init[..., np.newaxis])[..., 0]
update_v_rot(v_rot)  # operates in place

Depending on just how big that first dimension of v_rot is, this could run in a fifth of the time — or it might actually run more slowly. So you should test it for your data and your machine.

Third, and perhaps most importantly, if this is the actual code you're using, it looks like you're basically using the Euler method to integrate your evolution equations. This is not very accurate, and might even be unstable (depending on details of your problem). You can almost definitely get way faster and more accurate results using an actual ODE integrator like the ones available in scipy. Those integrators also have nice features like error control with automatic timestep-size adjustment.

(what is this operation reverse?)

It's just another name for conjugate. It comes from a deeper way to understand quaternions and more general things like them called "geometric algebra".

Your point about non-unit quaternions is quite interesting, as I never used R v R.inverse() ..., I always use Q v np.conjugate(Q) without thinking much about it.

If you're just rotating things, I generally recommend using inverse instead of conjugate unless you explicitly normalize Q at each time step. Depending on how you update the value of Q at each time step, errors may creep in — possibly even growing exponentially. Using inverse can dampen that kind of exponential growth in many cases, or just make it irrelevant to the result. Of course, if you explicitly ensure that Q is normalized, it doesn't make any difference because the inverse is just the conjugate divided by the squared norm.

Do I understand correctly, that I could use some non-unit quaternions to model such a behavior instead?

Potentially, depending on exactly what your evolution equations are. The basic idea is that instead of a differential equation for v_rot, you look for a differential equation for Q itself. That is, somehow you must have some evolution equation that tells you what d(v_rot)/dt is, which is what you use to update v_rot at each time step. Well, you could potentially plug in v_rot = Q * v_rot_init * Q.conjugate() to get an equivalent expression for dQ/dt. Then, you would just update the value of Q at each time step, and calculate v_rot using that, rather than trying to evolve v_rot. In this case, the changing norm of Q would also lead to a changing norm for v_rot. There are lots of physics problems where the expression for dQ/dt is much simpler and/or easier to numerically integrate — the classic one being eccentric orbits in Newtonian gravity. I don't know if your problem is one of them.

BEpresent commented 6 years ago

Those are great suggestions, thank you!

Interestingly, numpy's dot product is really slow

quat.as_rotation_matrix(Q) @ v_rot_init

e.g. for 3969 3x3 rotation matrices, a dot product of shape(3969, 3, 3) with v_rot_init of shape (3) , with the resulting v_rotof shape (3969, 3), the operation is 10 times slower per operation and the whole function is 50% slower than the previous approach. I also tried tensordot and np.einsum, with the same result, using Q*vQ.conjugate() is still faster.

The numba function works really well though! In the past, I never had much success with it, but it seems that they improved it by quite a bit or I just have to think about it differently, since I usually try to vectorize most operations instead of the C-style looping over the indices.

So Q.conjugate() (=reverse) is different from np.conjugate(Q) ? Q.reverse() is not an implemented function, is it ? (as I get an error when using it)

The approach of scipy.integrate.solve_ivp seems very interesting indeed, as my main motivation is calculation speed. I need to think how to translate the problem to this scipy solver domain.

moble commented 6 years ago

Interestingly, numpy's dot product is really slow

No, the dot product is actually crazy fast — it's as_rotation_matrix that's "slow". The point I was trying to make is that you should bite the bullet and use this slow as_rotation_matrix function if you're using one quaternion to rotate more than a few vectors. But now it looks like you're using 3969 quaternions to rotate 3969 vectors, so each quaternion just rotates one vector. And in that case, you should just stick to Q * v * Q.conjugate().

I usually try to vectorize most operations instead of the C-style looping over the indices.

Yeah, that's what you want to do when you're using just numpy, but it's also exactly what you want to avoid when you're using numba. The driving principle is that memory access is slow, so you want to do it as infrequently as possible. And that means that each time you read or write memory, you need to make every effort to ensure that you've done all the operations you will ever need for that piece of memory or locations near it — as far as possible.

The other thing that's slow is memory allocation, and your line

v_rot[:, 2] += (1 - v_rot[:, 2]) * relaxation_constant

actually creates a whole new array on the right-hand side before adding each element of that memory into the chosen the column of v_rot.

Even worse, Q * v * Q.conjugate() actually creates a new array to compute Q * v, then creates another new array to compute Q.conjugate(), and then creates a third new array to compute the product of those two. So if you really want high performance, you could even bring that calculation into your numba function. Unfortunately, that means that you have to program all the individual terms in that product yourself because numba doesn't know about quaternion's C code. I haven't checked this thoroughly, but this gives me a 4x speedup with your array shapes:

import numba as nb

@nb.njit
def sv_plus_rxv(q, v, w):
    w[0] = q[0] * v[0] + q[2]*v[2] - q[3]*v[1]
    w[1] = q[0] * v[1] + q[3]*v[0] - q[1]*v[2]
    w[2] = q[0] * v[2] + q[1]*v[1] - q[2]*v[0]
    return

@nb.njit
def v_plus_2rxv_over_m(q, v, w, two_over_m):
    v[0] = v[0] + two_over_m * (q[2]*w[2] - q[3]*w[1])
    v[1] = v[1] + two_over_m * (q[3]*w[0] - q[1]*w[2])
    v[2] = v[2] + two_over_m * (q[1]*w[1] - q[2]*w[0])
    return

@nb.njit
def rotate(q, v, w):
    m = q[1]*q[1]+q[2]*q[2]+q[3]*q[3]
    sv_plus_rxv(q, v, w)
    v_plus_2rxv_over_m(q, v, w, 2.0/m)
    return

@nb.njit
def rotate_and_update(Q, v, decay_constant, relaxation_constant):
    w = np.empty((3,))  # temporary working memory; allocate once to save time
    for i in range(v.shape[0]):
        rotate(Q[i], v[i], w)  # assumes Q.shape[0] == v.shape[0]
        v[i, 0] *= decay_constant  # scale x component
        v[i, 1] *= decay_constant  # scale y component separately for speed
        v[i, 2] += (1 - v[i, 2]) * relaxation_constant  # add relaxation for z component

So Q.conjugate() (=reverse) is different from np.conjugate(Q) ?

Only slightly. np.conjugate just calls Q.conjugate().

Q.reverse() is not an implemented function, is it ? (as I get an error when using it)

It is not. I was wondering why you were asking about it. I don't think I mentioned it anywhere...

The approach of scipy.integrate.solve_ivp seems very interesting indeed, as my main motivation is calculation speed. I need to think how to translate the problem to this scipy solver domain.

I would bet that this is by far the best thing you could do to improve your code, and probably wouldn't be too hard once you get a little practice using solve_ivp, so you understand how to write the function.

BEpresent commented 6 years ago

No, the dot product is actually crazy fast

Exactly, this is why I was surprised. You are correct, it is indeed the as_rotation_matrix operation that is the bottleneck.

Since I got some numba errors, I may have misunderstood what you mean:

So you suggest to replace the rotation Q * v * Q.conjugate() by your defined explicit rotation (+updates) rotate_and_update , i.e. this assumes Q is still an array of quaternions and v is still at least a quaternion (0,0,0,1) or is Q already an array of rotation matrices (=Q.as_rotation_matrix) ?

Judging from the indices in sv_plus_rxv, Q[i] as input to rotate(Q[i], v[i], w) must have a at least 4 dimensions (=quaternion array), but v[i] has 3 dimensions (i.e. non-quaternion) ?

moble commented 6 years ago

I'm suggesting that you use a call to rotate_and_update to replace the whole code block that you originally had as

v_rot = Q * v_rot_init * Q.conjugate(), 
v_rot_numpy = quat.as_float_array(v_rot)
v_rot_numpy[:,1:3]*=decay_constant #scale x, and y components
v_rot_numpy[:,3]+= (1-v_rot_numpy[:,3])*relaxation_constant  #add relaxation for z component
v_rot_init = quat.as_quat_array(v_rot_numpy)

The Q in the call to rotate_and_update should just be an array of floats. Here's my notebook where I tested this out. In particular, for the test, I just made up data using

Q = np.random.rand(3969, 4)
v = np.random.rand(3969, 3)

In this case, you would replace that whole code block above with

rotate_and_update(Q, v_rot_init, decay_constant, relaxation_constant)

Note that if you originally have Q as an array of quaternions and you convert them to an array of floats to call into this function, there's no reason to convert them back. But if you use this code, you should test it to make sure that it's actually doing what you want; I had to make some assumptions.

Of course, I just want to emphasize again that all this pales in comparison to the benefits that you would presumably get from using an actual ODE integrator.

BEpresent commented 6 years ago

Thank you very much again for the nice examples.

As an update, I tried the ODE integrator from scipy and it pleases me to know that it gets a result that is quite consistent with the other approach. However, it is actually 2-3 times slower for a vectorized approach of the same size.

The Q I have is indeed an array of quaternions that gets constructed from the rotation angles and the rotation axis at each step (since they change over time), similar to this simple example (to demonstrate how I construct my Q):

import numpy as np
import quaternion as quat

v = [3,5,0] #vector to be rotated
axis = [4,4,1]
theta = 1.2 

vector = np.array([0.] + v)
rot_axis = np.array([0.] + axis)
axis_angle = (theta*0.5) * rot_axis/np.linalg.norm(rot_axis) 

vec = quat.quaternion(*v)
qlog = quat.quaternion(*axis_angle)
q = np.exp(qlog)

v_prime = q * vec * np.conjugate(q)

It now turns out that the bottleneck of the whole operation is numpys np.exp() function on the whole array of quaternions that are constructed as in the example above, only vectorized for a couple of thousand cases. I.e. the Q * v_rot_init * Q.conjugate() line is not the problem, it is the construction of the Q via np.exp(qlog). So given the rotation angles/axes at each position, is this the fastest way to construct the array of quaternions? You mentioned above, that is is still strongly recommended to construct quaternions to create rotations, as other methods may lead to poor results. May I ask why this is the case? Does it have to do with an inherent stability of quaternion multiplications?

Regarding your code examples, given the quaternion array Q :

Q * v_rot_init * Q.conjugate() leads to a different result than your rotate(Q[i], v[i], w)function(looking just at a single step, e.g.i=0) (even without the update) (Where here Q isquat.as_float_array(Q), and v is just the 3 vector components to be rotated (i.e. the last 3 entries of the quaternion vector).

I am really grateful for your insightful answers, as I have already learned quite a bit from this Github issue thread, who would have thought?

moble commented 6 years ago

As an update, I tried the ODE integrator from scipy and it pleases me to know that it gets a result that is quite consistent with the other approach. However, it is actually 2-3 times slower for a vectorized approach of the same size.

I don't see how this could be true if you're comparing apples to apples. My guess is that you're taking larger time steps and getting far less accurate results, whereas solve_ivp is taking smaller time steps for far more accurate results. Forward Euler integration (again, I'm assuming that's what you were using originally) is not very accurate, and quite possibly downright unstable — plus it doesn't sound like you're doing anything to measure or control the error. At each time step you'll introduce some error, no matter what method you're using, but with a proper ODE integrator you have some idea of what the errors are and you can control them. For example, in solve_ivp you can adjust rtol and atol, and it will adjust the time step automatically to ensure that the error is within the tolerances. So it's certainly true that you can use your method with very large time steps and it will be faster than solve_ivp with the default settings, but your errors will be far larger. Conversely, if you increase rtol and atol to get the same errors as you get with your original method, the time steps will be substantially larger, and thus it will run faster. Of course, solve_ivp has the added bonus of detecting when anything interesting is happening and adaptively slowing down to get good results during that time.

The standard way of testing accuracy in a routine (like your original one) that doesn't have error control is to check the difference between solutions as you cut the time step size down — run the simulation once, then run it again with time steps that are twice as small, then twice as small again, etc., and just compare the solution values on your original set of time steps. If your system is stable, the difference between successive runs should converge to zero as the time step gets smaller and smaller. But even if it is stable, it will converge far more slowly than the default RK45. (Your version is "first-order accurate", while RK45 is "fifth-order".) You could also run solve_ivp for very small atol and rtol (maybe 1e-12 each), and ask for the solution (t_eval) at each of your original large time steps, and treat that as the "correct" version just to see how your version compares.

For a given level of error, solve_ivp will take fewer time steps, and thus be faster. The only way this isn't the case is if either your variables are evolving so trivially that even a first-order method can handle it precisely (which shouldn't be the case for your problem), or there's a major bug in scipy (which I find hard to believe).

So given the rotation angles/axes at each position, is this the fastest way to construct the array of quaternions?

No, for that specific application I'd use something like

sto2 = np.sin(theta/2)
cto2 = np.cos(theta/2)
axis = axis / np.linalg.norm(axis)
q = quaternion.quaternion(cto2, sto2*axis[0], sto2*axis[1], sto2*axis[2])

You mentioned above, that is is still strongly recommended to construct quaternions to create rotations, as other methods may lead to poor results. May I ask why this is the case? Does it have to do with an inherent stability of quaternion multiplications?

Basically, yes. Typically, the issue is that you need to decide what rotation you want. And if that decision process requires more than the most basic of operations, anything other than a quaternion will amplify errors — in some cases to the point of giving you total nonsense. Of course this is a general prescription, so you may be able to eke out extra nanoseconds of performance if you have additional information or special symmetries. But in general, you will never introduce errors by using quaternions instead of matrices or (god forbid!) Euler angles. Axis-angle is basically the quaternion logarithm, and thus every bit as good as long as you know what you're doing. Logarithms can behave very oddly in many cases, so you'd have to be aware of those issues. And of course, the logarithm and exponential are slow, so you may not want to use them anyway.

Q * v_rot_init * Q.conjugate() leads to a different result than your rotate(Q[i], v[i], w) function

I told you I hadn't checked that thoroughly! :) I think the problem is just because I should have defined m = q[0]*q[0]+q[1]*q[1]+q[2]*q[2]+q[3]*q[3]. Of course, you can just eliminate m from the code if you know that it's going to be 1 (or 1 within high numerical accuracy, as in your axis-angle case).