Open dmopalmer opened 8 years ago
I would like to add my support to this issue. I was previously using mlab.rec_append_fields
which is ~5x faster for small arrays and ~2x faster for large arrays. Unfortunately this function has been deprecated in matplotlib v2.2. Here's a timing example:
import numpy as np
import numpy.lib.recfunctions as recfn
from matplotlib import mlab
x = np.recarray(int(1e4),dtype=[('x',int),('y',int)])
%timeit mlab.rec_append_fields(x,['z'],[np.zeros(len(x))])
MatplotlibDeprecationWarning: The rec_append_fields function was deprecated in version 2.2.
1 loop, best of 10000: 140 µs per loop
%timeit recfn.rec_append_fields(x,['z'],[np.zeros(len(x))])
1 loop, best of 10000: 684 µs per loop
x = np.recarray(int(1e6),dtype=[('x',int),('y',int)])
%timeit mlab.rec_append_fields(x,['z'],[np.zeros(len(x))])
MatplotlibDeprecationWarning: The rec_append_fields function was deprecated in version 2.2.
1 loop, best of 10000: 33.8 ms per loop
%timeit recfn.rec_append_fields(x,['z'],[np.zeros(len(x))])
1 loop, best of 10000: 80.2 µs per loop
Part of the problem would seem to be that append_fields
only appends non-structured arrays. Inside the actual function, recursive_fill_fields
is used which should be fairly fast. But it seems to me there is room for a new join
function (which, effectively, joins by index), and which is nicely optimized for equal-length arrays.
Alternatively, and probably better, it may be an idea to special-case equal-length arrays in merge_arrays
(which does the same thing with flatten=True
; currently, it is even slower). In either case, I think one should recommend usemask=False
(which helps by 1.5 times).
My understanding is that recfunctions.rec_append_fields
already has usemask=False
builtin. So the timing improvement is relative to the first comment and not the second.
very slow
lib.recfunctions.append_fields is too slow to be usable for any reasonably-sized (~million row) record array.
This is discussed in http://stackoverflow.com/questions/5355744/numpy-joining-structured-arrays which gives solutions two orders of magnitude faster.
Including this speed-up in the library would be useful.
How slow you might ask?
For joining a couple of two-field records, 1M rows takes more than 20 seconds.