pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.74k stars 17.95k forks source link

No way to construct mixed dtype DataFrame without total copy, proposed solution #9216

Closed quicknir closed 2 years ago

quicknir commented 9 years ago

After hours of tearing my hair, I've come to the conclusion that it is impossible to create a mixed dtype DataFrame without copying all of its data in. That is, no matter what you do, if you want to create a mixed dtype DataFrame, you will inevitably create a temporary version of the data (e.g. using np.empty), and the various DataFrame will constructors will always make copies of this temporary. This issue has already been brought up, a year ago: https://github.com/pydata/pandas/issues/5902.

This is especially terrible for interoperability with other programming languages. If you plan to populate the data in the DataFrame from e.g. a call to C, the easiest way to do it by far is to create the DataFrame in python, get pointers to the underlying data, which are np.arrays, and pass these np.arrays along so that they can be populated. In this situation, you simply don't care what data the DataFrame starts off with, the goal is just to allocate the memory so you know what you're copying to.

This is also just generally frustrating because it implies that in principle (depending potentially on the specific situation, and the implementation specifics, etc) it is hard to guarantee that you will not end up using twice the memory you really should.

This has an extremely simple solution that is already grounded in the quantitative python stack: have a method analagous to numpy's empty. This allocates the space, but does not actually waste any time writing or copying anything. Since empty is already taken, I would propose calling the method from_empty. It would accept an index (mandatory, most common use case would be to pass np.arange(N)), columns (mandatory, typically a list of strings), types (list of acceptable types for columns, same length as columns). The list of types should include support for all numpy numeric types (ints, floats), as well as special Pandas columns such as DatetimeIndex and Categorical.

As an added bonus, since the implementation is in a completely separate method, it will not interfere with the existing API at all.

jreback commented 9 years ago

you can simply create an empty frame with an index and columns then assign ndarrays - these won't copy of you assign all of a particular dtype at once

you could create these with np.empty if you wish

quicknir commented 9 years ago
df = pd.DataFrame(index=range(2), columns=["dude", "wheres"])

df
Out[12]:
  dude wheres
0  NaN    NaN
1  NaN    NaN

x = np.empty(2, np.int32)

x
Out[14]: array([6, 0], dtype=int32)

df.dude = x

df
Out[16]:
   dude wheres
0     6    NaN
1     0    NaN

x[0] = 0

x
Out[18]: array([0, 0], dtype=int32)

df
Out[19]:
   dude wheres
0     6    NaN
1     0    NaN

Looks like it's copying to me. Unless the code I wrote isn't what you meant, or the copying that occurred is not the copy you thought I was trying to elide.

jreback commented 9 years ago

you changed the dtype that's why it copied try with a float

quicknir commented 9 years ago
y = np.empty(2, np.float64)

df
Out[21]:
   dude wheres
0     6    NaN
1     0    NaN

df.wheres = y

y
Out[23]: array([  2.96439388e-323,   2.96439388e-323])

y[0] = 0

df
Out[25]:
   dude         wheres
0     6  2.964394e-323
1     0  2.964394e-323

df = pd.DataFrame(index=range(2), columns=["dude", "wheres"])

df.dtypes
Out[27]:
dude      object
wheres    object
dtype: object

The dtype is object, so its changed regardless of whether I use a float or an int.

jreback commented 9 years ago
In [25]: arr = np.ones((2,3))

In [26]: df = DataFrame(arr,columns=['a','b','c'])

In [27]: arr[0,1] = 5

In [28]: df
Out[28]: 
   a  b  c
0  1  5  1
1  1  1  1

Constructing w/o a copy on mixed type could be done but is quite tricky. The problem is some types require a copy (e.g. object to avoid memory contention issues). And the internal structure consolidates different types, so adding a new type will necessitatte a copy. Avoiding a copy is pretty difficult in most cases.

You should just create what you need, get pointers to the data and then overwrite it. Why is that a problem?

quicknir commented 9 years ago

The problem is that in order to create what I need, I have to copy in stuff of the correct dtype, the data of which I have no intention of using. Even assuming that your suggestion of creating an empty DataFrame uses no significant RAM, this doesn't alleviate the cost of copying. If I want to create a 1 gigabyte DataFrame and populate it somewhere else, I'll have to pay the cost of copying a gigabyte of garbage around in memory, which is completely needless. Do you not see this as a problem?

Yes, I understand that the internal structure consolidates different types. I'm not sure exactly what you mean by memory contention issues, but in any case objects are not really what's of interest here.

Actually, while avoiding copies in general is a hard problem, avoiding them in the way I suggested is fairly easy because I'm supplying all the necessary information from the get-go. It's identical to constructing from data, except that instead of inferring the dtypes and the # of rows from data and copying the data, you specify the dtypes and # of rows directly, and do everything else exactly as you would have done minus the copy.

You need an "empty" constructor for every supported column type. For numpy numeric types this is obvious, it needs non-zero work for Categorical, unsure about DatetimeIndex.

jreback commented 9 years ago

passing a dict to the constructor and copy=False should work

jreback commented 9 years ago

So this will work. But you have to be SURE that the arrays that you are passing are distinct dtypes. And once you do anything to this it could copy the underlying data. So YMMV. you can of course pass in np.empty instead of the ones/zeros that I am.

In [75]: arr = np.ones((2,3))

In [76]: arr2 = np.zeros((2,2),dtype='int32')

In [77]: df = DataFrame(arr,columns=list('abc'))

In [78]: df2 = DataFrame(arr2,columns=list('de'))

In [79]: result = pd.concat([df,df2],axis=1,copy=False)

In [80]: arr2[0,1] = 20

In [81]: arr[0,1] = 10

In [82]: result
Out[82]: 
   a   b  c  d   e
0  1  10  1  0  20
1  1   1  1  0   0

In [83]: result._data
Out[83]: 
BlockManager
Items: Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
FloatBlock: slice(0, 3, 1), 3 x 2, dtype: float64
IntBlock: slice(3, 5, 1), 2 x 2, dtype: int32

In [84]: result._data.blocks[0].values.base
Out[84]: 
array([[  1.,  10.,   1.],
       [  1.,   1.,   1.]])

In [85]: result._data.blocks[1].values.base
Out[85]: 
array([[ 0, 20],
       [ 0,  0]], dtype=int32)
bashtage commented 9 years ago

Iniital attempt deleted since does not work since reindex forces casting, which is a strange "feature".

Have to use 'method', which make this attempt a little less satisfactory:

arr = np.empty(1, dtype=[('x', np.float), ('y', np.int)])
df = pd.DataFrame.from_records(arr).reindex(np.arange(100))

If you are really worried about performance, I'm not sure why one wouldn't just use numpy as much as possible since it is conceptually much simpler.

quicknir commented 9 years ago

jreback, thank you for your solution. This seems to work, even for Categoricals (which surprised me). If I encounter issues I'll let you know. I'm not sure what you mean by: if you do anything to this, it could copy. What do you mean by anything? Unless there are COW semantics I would think what you see is what you get with regards to deep vs shallow copies, at construction time.

I still think a from_empty constructor should be implemented, and I don't think it would be that difficult, while this technique works, it does involve a lot of code overhead. In principle this could be done by specifying a single composite dtype and a number of rows.

bashtage, these solutions still write into the entire DataFrame. Since writing is generally slower than reading, this means at best it saves less than half the overhead in question.

Obviously if I haven't gone and used numpy, its because pandas has many awesome features and capabilities that I love, and I don't want to give those up. Were you really asking, or just implying that I should use numpy if I don't want to take this performance hit?

quicknir commented 9 years ago

Scratch this, please, user error, and my apologies. reindex_axis with copy=False worked perfectly.

bashtage commented 9 years ago

bashtage, these solutions still write into the entire DataFrame. Since writing is generally slower than reading, this means at best it saves less than half the overhead in question.

True, but all that you need to a new method for reindex that will not fill with anything and then you can allocate a typed array with arbitrary column types without writing/copying.

Obviously if I haven't gone and used numpy, its because pandas has many awesome features and capabilities that I love, and I don't want to give those up. Were you really asking, or just implying that I should use numpy if I don't want to take this performance hit?

It was a bit rhetorical - although also a serious suggestion from a performance point of view since numpy makes it much easier to get close to the data-as-a-blob-of-memory access that is important if you are trying to write very high performance code. You can always convert from numpy to pandas when code simplicity is more important than performance.

quicknir commented 9 years ago

I see what you are saying. I still think it should more cleanly be part of the interface rather than a workaround, but as workarounds go it is a good one and easy to implement.

Pandas still emphasizes performance as one if its main objectives. Obviously it has higher level features compared to numpy, and those have to be paid for. What we're talking about has nothing to do with those higher level features, and there's no reason why one should be paying for massive copies in places where you don't need them. Your suggestion would be appropriate if someone was making a stink about the cost of setting up the columns, index, etc, which is completely different from this discussion.

bashtage commented 9 years ago

I think you are overestimating the cost of writing vs. the code of alloating memory in Python -- the expensive part is the memory allocation. The object creation is also expensive.

Both allocate 1GB of memory, one empty and one zeros.

%timeit np.empty(1, dtype=[('x', float), ('y', int), ('z', float)])
100000 loops, best of 3: 2.44 µs per loop

%timeit np.zeros(1, dtype=[('x', float), ('y', int), ('z', float)])
100000 loops, best of 3: 2.47 µs per loop

%timeit np.zeros(50000000, dtype=[('x', float), ('y', int), ('z', float)])
100000 loops, best of 3: 11.7 µs per loop

%timeit np.empty(50000000, dtype=[('x', float), ('y', int), ('z', float)])
100000 loops, best of 3: 11.4 µs per loop

3µs for zeroing 150,000,000 values.

Now compare these for a trivial DataFrame.

%timeit pd.DataFrame([[0]])
1000 loops, best of 3: 426 µs per loop

Around 200 times slower for trivial. But it is far worse for larger arrays.

%timeit pd.DataFrame(np.empty((50000000, 3)),copy=False)
1 loops, best of 3: 275 ms per loop

Now it takes 275ms -- note that this is not copying anything. The cost is in setting up the index, etc which is clearly very slow when the array is nontrivially big.

This feels a like a premature optimization to me since the other overheads in pandas are so large that the malloc + filliing component is near 0 cost.

It seems that if you want to allocate anything in a tight loop that is must be a numpy array for performance reasons.

jreback commented 9 years ago

ok, here's what I think we should do, @quicknir if you'd like to make some improvements. 2 issues.

This is slightly non-trivial but would then allow one to pass in an already created ndarray (could be empty) with mixed types pretty easily. Note that this would likely (in a first pass implementation) handle only (int/float/string). as datetime/timedelta need special sanitizing and would make this slighlty more complicated.

so @bashtage is right from a perf perspective. It makes a lot of sense to simply construct the frame as you want then modify the ndarrays (but you MUST do this by grabbing the blocks, otherwise you will get copies).

What I meant above is this. Pandas groups any like-dtype (e.g. int64,int32 are different) into a 'block' (2-d in a frame). These are a contiguous memory ndarray (that is newly allocated, unless it is simply passed in which only currently works for a single dtype). If you then do a setitem, e.g. df['new_columns'] = 5 and you already have a int64 block, then this new column will ultimatly be concatetated to it (resulting in a new memory allocation for that dtype). If you were using a reference as a view on this it will no longer be valid. That's why this is not a strategy you can employ w/o peering at the DataFrame internals.

jreback commented 9 years ago

@bashtage yeh the big cost is the index as you have noted. a RangeIndex (see #939) would solve this problem completely. (it is actually almost done in a side branch, just needs some dusting off).

bashtage commented 9 years ago

Even with an optimized RangeIndex it will still be 2 orders of magnitude slower than constructing a NumPy array, which is fair enough given the much heavier weight nature and additional capabilities of a DataFrame.

I think this can only be considered a convenience function, and not a performance issue.It could be useful to initialize a mixed type DataFrame or Panel like.

dtype=np.dtype([('GDP', np.float64), ('Population', np.int64)])
pd.Panel(items=['AU','AT'],
         major_axis=['1972','1973'],
         minor_axis=['GDP','Population'], 
         dtype=[np.float, np.int64])
jreback commented 9 years ago

this is only an API / convenience issue

agreed the perf is really an incidental issue (and not the driver)

quicknir commented 9 years ago

@bashtage

%timeit pd.DataFrame(np.empty((100, 1000000))) 100 loops, best of 3: 15.6 ms per loop

%timeit pd.DataFrame(np.empty((100, 1000000)), copy=True) 1 loops, best of 3: 302 ms per loop

So copying into a dataframe seems to take 20 times longer than all the other work involved in creating the DataFrame, i.e. the copy (and extra allocation) is 95% of the time. The benchmarks you did do not benchmark the correct thing. Whether the copy itself or the allocation is what's taking time doesn't really matter, the point is that if I could avoid copies for a multiple dtype DataFrame the way I can for a single dtype DataFrame I could save a huge amount of time.

Your two order of magnitude reasoning is also deceiving. This is not the only operation being performed, there are other operations being performed that take time, like disk reads. Right now, the extra copy I need to do to create the DataFrame is taking about half the time in my simple program that just reads the data off disk and into a DataFrame. If it took 1/20 th as much time, then the disk read would be dominant (as it should be) and further improvements would have almost no effect.

So I want to again emphasize to both of you: this is a real performance issue.

jreback, given that the concatenation strategy does not work for Categoricals, don't think that the improvements you suggested above will work. I think a better starting point would be reindex. The issue right now is that reindex does lots of extra stuff. But in principle, a DataFrame with zero rows has all the information necessary to allow the creation of a DataFrame with the correct number of rows, without doing any unnecessary work. Btw, this makes me really feel like pandas needs a schema object, but that's a discussion for another day.

bashtage commented 9 years ago

I think we wil have to agree to disagree. IMO DataFrames are not extreme performance objects in the numeric ecosystem, as show by the order of magntude difference between a basic numpy array and a DataFrame creation.

%timeit np.empty((1000000, 100))
1000 loops, best of 3: 1.61 ms per loop

%timeit pd.DataFrame(np.empty((1000000,100)))
100 loops, best of 3: 15.3 ms per loop

Right now, the extra copy I need to do to create the DataFrame is taking about half the time in my simple program that just reads the data off disk and into a DataFrame. If it took 1/20 th as much time, then the disk read would be dominant (as it should be) and further improvements would have almost no effect.

I think this is even less reason to care about DataFrame performance -- even if you can make it 100% free, the total program time only declines by 50%.

I agree that there is scope for you to do a PR here to resolve this issue, whether you want to think of it as a performance issue or as a convenience issue. From my POV, I see it as the latter since I will always use a numpy array when I care are performance. Numpy does other things like not use a block manager which is relatively efficient for some things (like growing the array by adding columns). but bad from other points of view.

There could be two options. The first, an empty constructor as in the example I gave above. This would not copy anything, but would probably Null-fill to be consistent with other things in pandas. Null filling is pretty cheap and is not at the root of the problem IMO.

The other would be to have a method DataFrame.from_blocks that would take preformed blocks to pass straight to the block manager. Something like

DataFrame.from_blocks([np.empty((100,2)), 
                       np.empty((100,3), dtype=np.float32), 
                       np.empty((100,1), dtype=np.int8)],
                     columns=['f8_0','f8_1','f4_0','f4_1','f4_2','i1_0'],
                     index=np.arange(100))

A method of this type would enforce that the blocks have compatible shape, all blocks have unique types, as well as the usual checks for shape of the index and columns. This type of method would do nothing to the data and would use it in the BlockManger.

jreback commented 9 years ago

@quicknir you are trying to combine pretty complicated things. Categorical don't exist in numpy, rather they are a compound dtype like that is a pandas construct. You have to construct and assign then separately (which is actually quite cheap - these are not combined into blocks like other singular dtypes).

@bashtage soln seems reasonable. This could provide some simple checks and simply pass thru the data (and be called by the other internal routines). Normally the user need not concern themselves with the internal repr. Since you really really want to, then you need to be cognizant of this.

All that said, I am still not sure why you don't just create a frame exactly like you want. Then grab the block pointers and change the values. It costs the same memory, and as @bashtage points out this is pretty cheap to create essentially a null frame (that has all of the dtype,index,columns) already set.

quicknir commented 9 years ago

Not sure what you mean by the empty constructor, but if you mean constructing a dataframe with no rows and the desired schema and calling reindex, this is the same amount of time as creating with copy=True.

Your second proposal is reasonable, but only if you can figure out how to do Categoricals. On that subject, I was going through the code and I realized that Categoricals are non-consolidatable. So on a hunch, I created an integer array and two categorical Series, I then created three DataFrames, and concatenated all three. Sure enough, it did not perform a copy even though two of the DataFrames had the same dtype. I will try to see how to get this to work for Datetime Index.

@jreback I still do not follow what you mean by create the frame exactly like you want.

jreback commented 9 years ago

@quicknir why don't you show a code/pseudo-code sample of what you are actually trying to do.

quicknir commented 9 years ago
def read_dataframe(filename, ....):
   f = my_library.open(filename)
   schema = f.schema()
   row_count = f.row_count()
   df = pd.DataFrame.from_empty(schema, row_count)
   dict_of_np_arrays = get_np_arrays_from_DataFrame(df)
   f.read(dict_of_np_arrays)
   return df

The code previous was constructing a dictionary of numpy arrays first, and then constructing a DataFrame from that because it was copying everything. About half the time was being spent on that. So I am trying to change it to this scheme. The thing is, that constructing df as above even when you don't care about the contents is extremely expensive.

jreback commented 9 years ago

@quicknir dict of np arrays requires lots of copying.

You should simply do this:

# construct your biggest block type (e.g. say you have mostly floats)
df = DataFrame(np.empty((....)),index=....,columns=....)

# then add in other things you need (say strings)
df['foo'] = np.empty(.....)

# say ints
df['foo2'] = np.empty(...)

if you do this by dtype it will be cheap

then.

for dtype, block in df.as_blocks():
    # fill the values
    block.values[0,0] = 1

as these block values are views into numpy arrays

quicknir commented 9 years ago

The composition of types isn't known in advance in general, and in the most common use case there is a healthy mix of floats and ints. I guess I don't follow how this will be cheap, if I have 30 float columns and 10 int columns, then yes, the floats will be very cheap. But when you do the ints, unless there is some way to do them all at once that I'm missing, each time you add one more column of ints it will cause the entire int block to be reallocated.

The solution you gave me previously is close to working, I can't seem to make it work out for DatetimeIndex.

bashtage commented 9 years ago

Not sure what you mean by the empty constructor, but if you mean constructing a dataframe with no rows and the desired schema and calling reindex, this is the same amount of time as creating with copy=True.

An empty constructor would look like

dtype=np.dtype([('a', np.float64), ('b', np.int64), ('c', np.float32)])
df = pd.DataFrame(columns='abc',index=np.arange(100),dtype=dtype)

This would produce the same output as

dtype=np.dtype([('a', np.float64), ('b', np.int64), ('c', np.float32)])
arr = np.empty(100, dtype=dtype)
df = pd.DataFrame.from_records(arr, index=np.arange(100))

only it wouldn't copy data.

Basically the constructor would allow for a mixed dtype for the following call which works but only a single basic dtype.

df = pd.DataFrame(columns=['a','b','c'],index=np.arange(100), dtype=np.float32)

The only other feature would be to prevent it from null-filling int arrays which has the side effect of converting them to object dtype since there is no missing value for ints.

bashtage commented 9 years ago

Your second proposal is reasonable, but only if you can figure out how to do Categoricals. On that subject, I was going through the code and I realized that Categoricals are non-consolidatable. So on a hunch, I created an integer array and two categorical Series, I then created three DataFrames, and concatenated all three. Sure enough, it did not perform a copy even though two of the DataFrames had the same dtype. I will try to see how to get this to work for Datetime Index.

The from_block method would have to know the rules of consolidation, so that it would allow multiple categoricals, but only one of other basic types.

jreback commented 9 years ago

yep...this is not that difficult to do....looking for someone who wants to have a gentle introduction to the internals..... hint.hint.hint.... :)

quicknir commented 9 years ago

Haha, I am willing to do some implementation work, don't get me wrong. I will try to look at the internals this weekend and get a sense of which constructor is easier to implement. First though I need to deal with some DatetimeIndex issues I'm having in a separate thread.

ARF1 commented 9 years ago

@quicknir Have you found a solution to this?

I am looking for a way to cheaply allocating (but not filling) a mixed-dtype dataframe to allow copy-less filling of the columns from a cython library.

It would be great if you were willing to share any code you have (even semi-working) to help me get started.

ARF1 commented 9 years ago

Would the following be a sensible approach? I side-stepped re-creating the blocking logic by working from a prototype dataframe.

Which dtypes need special treatment apart from categoricals?

Of course, using the created dataframe is not safe until it has been filled...

import numpy as np
from pandas.core.index import _ensure_index
from pandas.core.internals import BlockManager
from pandas.core.generic import NDFrame
from pandas.core.frame import DataFrame
from pandas.core.common import CategoricalDtype
from pandas.core.categorical import Categorical
from pandas.core.index import Index

def allocate_like(df, size, keep_categories=False):
    # define axes (waiting for #939 (RangeIndex))
    axes = [df.columns.values.tolist(), Index(np.arange(size))]

    # allocate and create blocks
    blocks = []
    for block in df._data.blocks:
        # special treatment for non-ordinary block types
        if isinstance(block.dtype, CategoricalDtype):
            if keep_categories:
                categories = block.values.categories
            else:
                categories = Index([])
            values = Categorical(values=np.empty(shape=block.values.shape,
                                                 dtype=block.values.codes.dtype),
                                 categories=categories,
                                 fastpath=True)
        # ordinary block types
        else:
            new_shape = (block.values.shape[0], size)
            values = np.empty(shape=new_shape, dtype=block.dtype)

        new_block = block.make_block_same_class(values=values,
                                                placement=block.mgr_locs.as_array)
        blocks.append(new_block)

    # create block manager
    mgr = BlockManager(blocks, axes)

    # create dataframe
    return DataFrame(mgr)

# create a prototype dataframe
import pandas as pd
a = np.empty(0, dtype=('i4,i4,f4,f4,f4,a10'))
df = pd.DataFrame(a)
df['cat_col'] = pd.Series(list('abcabcdeff'), dtype="category")

# allocate an alike dataframe
df1 = allocate_like(df, size=10)
jreback commented 9 years ago

@ARF1 not really sure what the end goal is can u provide a simple example

further concat with copy=False will generally side step this

ARF1 commented 9 years ago

@jreback I want to use a cython library to read large-volume data column-by-column from a compressed data store which I want to uncompress directly into a dataframe without intermediary copying for performance reasons.

Borrowing from the usual numpy solution in such cases, I want to pre-allocate the memory for a dataframe so that I can pass pointers to these allocated memory regions to my cython library which can then use ordinary c-pointers/c-arrays corresponding to those memory regions to fill the dataframe directly without intermediary copying steps (or the generation of intermediary python objects). The option to fill the dataframes with multiple cython threads in parallel with released gil would be a fringe benefit.

In (simplified) pseudo-code the idom would be something like:

df = fn_to_allocate_memory()
colums = df.columns.values
column_indexes = []
for i in xrange(len(df._data.blocks)):
    column_indexes.extend(df._data.blocks[i].mgr_locs.as_array)
block_arrays = [df._data.blocks[i].values for i in len(df._data.blocks)]

some_cython_library.fill_dataframe_with_content(columns, column_indexes, block_arrays)

Does this make any sense to you?

As I understand concat with copy=False will not coalesce columns with identical dtypes into blocks but operations down the line will trigger this - resulting in the copying I am trying to avoid. Or did I misunderstand the internal operation of pandas?

While I have made some progress with the instantiation of large (non-filled) dataframes (factor ~6.7) I am still far from numpy speeds. Only another factor of ~90 to go...

In [157]: a = np.empty(int(1e6), dtype=('i4,i4,f4,f4,f4,a10'))

In [158]: df = pd.DataFrame(a)

In [162]: %timeit np.empty(int(1e6), dtype=('i8,i4,i4,f4,f4,f4,a10'))
1000 loops, best of 3: 247 µs per loop

In [163]: %timeit allocate_like(df, size=int(1e6))
10 loops, best of 3: 22.4 ms per loop

In [164]: %timeit pd.DataFrame(np.empty(int(1e6), dtype=('i4,i4,f4,f4,f4,a10')))

10 loops, best of 3: 150 ms per loop

Another hope was that this approach might also allow quicker repeated instantiation of identically shaped DataFrames when small-volume data is read frequently. That has not been the main objective so far but inadvertently I made better progress with this: only a factor of ~4.8 to go to numpy speed.

In [157]: a = np.empty(int(1e6), dtype=('i4,i4,f4,f4,f4,a10'))

In [158]: df = pd.DataFrame(a)

In [159]: %timeit np.empty(0, dtype=('i8,i4,i4,f4,f4,f4,a10'))
10000 loops, best of 3: 79.9 µs per loop

In [160]: %timeit allocate_like(df, size=0)
1000 loops, best of 3: 379 µs per loop

In [161]: %timeit pd.DataFrame(np.empty(0, dtype=('i4,i4,f4,f4,f4,a10')))
1000 loops, best of 3: 983 µs per loop

Edit

The above timings are painting a far too pessimistic picture as they compare apples to oranges: while the numpy string column is created as fixes length native strings, the equivalent column in pandas will be created as a python object array. Comparing alike to alike pushes DataFrame instantiation to numpy speeds with the exception of the index generation which is responsible for about 92% of the instantiation time.

jreback commented 9 years ago

@ARF1 if you want numpy speeds, then just use numpy. I am not sure what you are actually doing or what you are doing in cython. The usual solns are to chunk your calculations, pass single dtypes to cython or just get a bigger machine.

DataFrames doing quite a lot more than numpy about how they describe and manipulate data. It is not what you are actually doing with them.

jreback commented 9 years ago

almost all pandas operations copy. (as do most numpy operations), so not sure what you are after.

ARF1 commented 9 years ago

@jreback I am currently using numpy but I have mixed dtypes which can only (conveniently) be handled with structured arrays. Structured arrays however are inherently row-major ordered which clashes with my typical analysis dimension leading to poor performance. Pandas looks like the natural alternative due to its column-major ordering - if I can get the data into the dataframe at a good speed.

Of course the alternative would be using a dict of differently dtyped numpy arrays but that makes analysis a pain since slicing etc is no longer possible.

The usual solns are to chunk your calculations, pass single dtypes to cython.

That is what I am doing with the block_arrays variable in my example.

or just get a bigger machine.

A factor of 100+ faster is a bit of a financial challenge to me. ;-)

jreback commented 9 years ago

@ARF1 you have a very odd model of how things work. Typically you create a small number of data frames, then work on them. The creation speed is a tiny fraction of any real computation or manipulation.

quicknir commented 9 years ago

@jreback: this is not an odd model. Maybe it is an odd model if you view things from a pure python perspective. If you are working with C++ code, the easiest way to read data into python objects is to pass it pointers to pre-existing python objects. If you are doing this in a performance sensitive context, you want a cheap and stable (in the sense of memory location) way to create the python object.

I'm honestly not sure why this attitude is common on the pandas boards. I think it's unfortunate, insofar as while I understand that pandas is a higher level construct than numpy, it could still be easier for people to develop "on top" of pandas. The pandas DataFrame is by far the most desirable type to work with if you have C code that wants to spit tabular data into python, so this really seems like an important use case.

Please don't take what I'm writing negatively, if I didn't think pandas DataFrames were so awesome, I would just use numpy records or something like that and be done with it.

@ARF1: Ultimately, I don't remember the reasons, but the best i was able to do was to create a DataFrame for each numeric type from a numpy array with Copy=False, and then use pandas.concat with Copy=False again to concatenate them. When you create a single type DataFrame from a numpy array, be very careful about the orientation of the numpy array. If the orientation is wrong, then the numpy arrays corresponding to each column will be non-trivially strided, and pandas doesn't like this and will make a copy at the first opportunity. You can tack on the Categoricals at the end, as they do not get consolidated and shouldn't trigger any copies of the rest of the frame.

I recommend writing some unit tests that perform this operation step by step and continually grab the pointers to the underlying data (via the array_interface of the underlying numpy array) and verify that they are the same to ensure that the copy is actually being elided. It's a very unfortunate decision by pandas that copy/inplace parameters do NOT have to be honored. That is, even if you set e.g. copy=False for a DataFrame constructor, pandas will still perform a copy if it decides that it needs to in order to construct the DataFrame. The fact that pandas does this instead of throwing when arguments cannot be honored makes reliably writing code that elides copies very exhausting, and requires being extremely methodical. If you don't write unit tests to verify, you may accidentally tweak something later that causes a copy to be made, and it will happen silently and ruin your performance.

jreback commented 9 years ago

@quicknir if you say so. I think you should simply profile before you try to optimize things. As I said before, and prob will again. Construction time should not dominate anything. If it does, then you are just using the DataFrame to hold things, so what's the point of using it in the first place? If it doesn't dominate, then what is the problem?

quicknir commented 9 years ago

@jreback You write that, assuming that I have not already profiled. In fact, I have. We have c++ and python code that both de-serialize tabular data from the same data format. While I expected the python code to have a bit of overhead, I figured the difference should be small, because disk read time should dominate. This was not the case, before I went and extremely carefully reworked things to minimize copies, the python version was taking twice as long or worse compared to the C++ code, and almost all the overhead was just in creating the DataFrame. In other words, it took roughly as long to simply create a DataFrame of a certain very large size whose contents I didn't care about at all, as to read, decompress, and write the data I cared about into that DataFrame. That's extremely poor performance.

If I was an end user of this code with specific operations in mind, maybe what you're saying about construction not dominating would be valid. In reality, I'm a developer, and the end users of this code are other people. I don't know exactly what they will be doing with the DataFrame, the DataFrame is the one way to get an in memory representation of the data on disk. If they want to do something very simple with the data on disk, they still have to go through the DataFrame format.

Obviously, I could support more ways of getting at the data (e.g. numpy constructs), but this would greatly increase branching in the code, and make things much harder for me as a developer. If there was some fundamental reason why DataFrames need to be so slow I would understand, and decide whether to support DataFrame, numpy, or both. But there is no real reason why it needs to be so slow. One could write a DataFrame.empty method that takes an array of tuples where each tuple contains the column name and type, and the number of rows.

This is the difference I mean between supporting users and library writers. It's easier to write your own code than to write a library. And it's easier to have your library only support users instead of other library writers. I just think in this case, empty allocation of DataFrames would be low hanging fruit in pandas that would make the life of people like me and @ARF1 easier.

jreback commented 9 years ago

well if you would like to have a reasonable tested documented soln, all ears. pandas has quite a few users / developers. That is the reason the DataFrame is so versatile and the same reason why it needs lots of error checking and inference. You are welcome to see what you can do as described above.

quicknir commented 9 years ago

I am willing to put some time implementing this, but only if there is some reasonable consensus on the design from a few of the pandas developers. If I submit a pull request and there's certain things people want to change, that's cool. Or if I realize after I've put ten hours into it that there's no way to do something cleanly, and the only way to do it might involve something people think is objectionable, that's cool too. But I'm not really cool with spending X hours and being told this isn't that useful, the implementation is messy, we don't think it can really be cleaned up, complicates the codebase, etc. I don't know if I'm way off with this sentiment, I haven't made major contributions to an OSS project before so I don't know how it works. It's just that in my initial post I started off proposing this very thing, and then frankly I got the impression from you that it was sort of "out of scope" for pandas.

If you want I can open a new issue, create as specific a design proposal as I can, and once there is feedback/tentative approval I will work on it when I'm able.

jreback commented 9 years ago

@quicknir the key thing is that it must pass the entire test suite, which is pretty comprehensive.

This is not out of scope of pandas, but the API must be somewhat user-friendly.

I am not sure why you didn't like

concat(list_of_arrays,axis=1,copy=False) I believe this does exactly what you want (and if not, then not clear what you actually do want).

quicknir commented 9 years ago

I ended up using a similar technique, but with a list of DataFrames that were created from a single numpy array, each of different types.

First off, I think I still ran into some copies when I did this technique. As I said, pandas doesn't always honor copy=False, so it's very exhausting to see if your code is actually copying or not. I really wish that for pandas 17, devs would consider making copy=True the default, and then copy=False throws when a copy cannot be elided. But anyhow.

Second, another issue was having to reorder the columns afterwards. This was surprisingly awkward, the only way I could find to do this without a copy being made was to originally make the column names integers that were ordered in the desired final order. I then did an index sort in place. I then changed the column names.

Third, I found that copies were unavoidable for timestamp types (numpy datetime64).

I wrote this code a while ago so it's not fresh in my mind. It's possible I made mistakes, but I went through it pretty carefully and those were the results I came up with at the time.

The code you give above does not even work for numpy arrays. It fails with: TypeError: cannot concatenate a non-NDFrame object. You have to make them DataFrames first.

It's not that I don't like the solution you've given here, or above. I just have yet to see a simple one that works.

jreback commented 9 years ago

@quicknir well my example above works. pls provide exactly what you are doing and I can try to help you.

quicknir commented 9 years ago

pd.concat([np.zeros((2,2))], axis=1, copy=False)

I'm on pandas 0.15.2, so perhaps this started working in 0.16?

jreback commented 9 years ago

pls read the doc-string of pd.concat. you need to pass a DataFrame

jreback commented 9 years ago

btw copy=True IS the default

quicknir commented 9 years ago

Right, that's what I wrote. The code snippet you wrote above had list_of_arrays, not list_of_dataframes. Anyhow, I think we understand each other. I did end up using the pd.concat method, but it's pretty non-trivial, there's a whole bunch of gotcha to trip people up:

1) You must create a list of DataFrames. Each DataFrame must have exactly one distinct dtype. So you have to collect up all the different dtypes before you start.

2) Each DataFrame must be created from a single numpy array of desired dtype, same number of rows, desired number of columns, and order ='F' flag; if order='C' (default) then pandas will often make copies when it otherwise wouldn't.

3) Disregard 1) for Categoricals, they are not amalgamated into a block so you can tack them on later.

4) When you create all the individual DataFrames, the columns should be named using integers that represent the order you want them in. Otherwise there may be no way to change the column order without triggering copies.

5) Having created your list of DataFrames, use concat. You'll have to painstakingly verify that you didn't mess anything up, because copy=False will not throw if a copy cannot be elided, but rather silently copy.

6) Sort the column index to achieve the ordering you want, then rename the columns.

I applied this procedure rigorously. It's not a one liner, there's a lot of places to make mistakes, I'm pretty sure it still didn't work for timestamps, and there's a lot of unnecessary overhead that could be elided by not using the interface only. If you like, I can write up a draft of what this function looks like using only the public API, maybe combined with some tests to see if it's really eliding copies, and for which dtypes.

Also, copy=False is the default for e.g. the DataFrame constructor. My main point is more that a function that cannot honor its arguments should throw rather than "do something reasonable". That is, if copy=False cannot be honored, an exception should be thrown so the user knows they either have to change other inputs so that copy elision can take place, or they have to change copy to True. A copy should never happen silently when copy=False, this is more surprising and less conducive to a performance conscious user finding bugs.