pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.81k stars 17.98k forks source link

Expose the blocks API and disable automatic consolidation #10556

Closed shoyer closed 1 year ago

shoyer commented 9 years ago

In my discussion with Jonathan and others and at the SciPy sprints, we agreed that it would be really nice to expose some minimal tools for manipulating and view the internal pandas blocks system. For example, it should be possible to:

  1. manually consolidate blocks
  2. view a representation of the internal blocking of a dataframe (via matplotlib?)

It's not so much that we want to create and use blocks directly, but that we want to make it easier to understand the internal data model and make performance with more predictable.

At the same time, we would like to disable automatic consolidation of blocks in the DataFrame constructor and when inserting new columns. Consolidation is certainly a useful feature, but it is currently not always possible to even predict when it will happen.

Most users never notice or care about consolidation. Power users (concerned about memory or performance) are at least as likely to find it frustrating as helpful, so we should make this something that they can trigger explicitly (as part of the blocks API). This would make it possible to create dataframes while guaranteeing that none of the data is copied (#9216).

cc @jonathanrocher @sinhrks @jreback @cpcloud @TomAugspurger @ARF1 @quicknir

jonathanrocher commented 9 years ago

Yes, I am definitely a big +1 on both of these ideas. I need to understand the current state better, before experimenting with such a tool. Just to clarify what @shoyer means by the blocks API, we discussed exposing these view and consolidation methods via a blocks attribute: df.blocks.view(), df.blocks.consolidate(), or something like that.

jreback commented 9 years ago

well

df.as_blocks(copy=False) (copy flag added in master recently) gives you a view right now

and as discussed, I would like to see why this should actually be exposed to the user. What is the purpose? A good use case would go a long way. Showing an implementation is only guaranteed to cause issues down the road when one wants to change it (and cannot because its exposed)

shoyer commented 9 years ago

I don't think we should expose the full blocks API. Rather, we should expose a limited set of functionality suitable for basic understand and manipulation of blocks (mostly .consolidate_blocks()), so we can have the option of disabling automatic consolidation. If we ever change the underlying implementation, we can just make this method a no-op.

For viewing block information, we might expose a tuple, similarly to the .chunks attribute on dask arrays.

sinhrks commented 9 years ago

I assume exposing some API is for let power users experiment, rather than daily usage? I think adding more detail to internals.rst is very nice (#4082), but not sure what level of "public" is actually required (guarantee backward compat or adding to API.rst, etc)

+1 for optionally disabling auto consolidation, it should allow pandas handle larger size of data.

jreback commented 9 years ago

I still have not seen any reason to actually expose anything about the internals as an API.

It seems that one might want to control how the internal engine works, e.g. simply have an option on construction that allows potentially different internal representation.

No consolidation is a nice idea on the surface and their may be a reason to allow this and/or switch as a default. HOWEVER, I would like to see someone do a performance study to see if it actually makes a difference in a non-trivial case.

Turning off auto-consolidation is actually trivial, so if someone would like to step up, all ears.

quicknir commented 9 years ago

@shoyer Thanks for tagging me on this. I am a strong +1 on this. As an example of a use case, ARF1 showed in #9216 how one could use the block manager interface to create an empty DataFrame nearly 100 times faster than the naive approach. It would be nice if one could do this without accessing _implementation methods that are subject to change without warning.

When developing a C/C++ library that interfaces with pandas, memory layout is critical. Increasing public interface that allows better guarantees and more explicit behavior of memory layout is very helpful. I can give other examples in pandas where similar issues arise.

jreback commented 9 years ago

https://github.com/jreback/pandas/commit/44c280bbebe7f4e7f1c041ee40708a51fa8d03a6

is a branch that allows a parameter policy to control how consolidation works.

Its not completely automatic at this point (e.g. you have to pass policy='split|column' to get this behavior (which could be an option, support.consolidation). Further for a dict we can simply default to column I think.

note that:

jreback commented 9 years ago

Here are some perf considerations

In [3]: np.random.seed(1234)

# blocks
In [4]: df_blocks = DataFrame(np.random.randn(100000,1000))

# create a single ndarray, then split it
In [5]: df_split = DataFrame(np.random.randn(100000,1000),policy='split')

# create a dict of ndarrays, no need to split
In [10]: df_dsplit = DataFrame(dict([ (i,np.random.randn(100000)) for i in range(1000) ]),policy='split')

Creations

# block
In [7]: %timeit DataFrame(np.random.randn(100000,1000))
1 loops, best of 3: 3.79 s per loop

# ndarray-split
In [6]: %timeit DataFrame(np.random.randn(100000,1000),policy='split')
1 loops, best of 3: 5.93 s per loop

# dict-split
In [14]: %timeit DataFrame(dict([ (i,np.random.randn(100000)) for i in range(1000) ]),policy='split')
1 loops, best of 3: 3.93 s per loop

row-wise ops

# blocked
In [8]: %timeit df_blocks.sum()
1 loops, best of 3: 1.55 s per loop

# ndarray-split
In [9]: %timeit df_split.sum()
1 loops, best of 3: 607 ms per loop

# dict-split
In [15]: %timeit df_dsplit.sum()
1 loops, best of 3: 617 ms per loop

So you may be suprised by the speed of the splits. This is because these are already c-contiguous, while the block memory is NOT. Its in c-order, but not for a row-wise op.

shoyer commented 9 years ago

For the creation benchmarks, it would be helpful to put generating the random values outside the loop. This does look very encouraging, though.

jbrockmendel commented 7 years ago

Would this make the "dict of Series" description of a DataFrame more reliable? The sometimes-vexing behavior I'm thinking of is

>>> arr = np.random.randn(1000, 5)
>>> df = pd.DataFrame(arr, columns=list('ABCDE'))
>>> A = df['A']
>>> G = 2*A
>>> G.name = 'G'
>>> df['G'] = G
>>> df['G'] is G
False
>>> df['A'] is A
False

Having id(df['A']) be consistent would make keeping track of _metadata simpler in some cases.

jreback commented 7 years ago

this issue is not going to happen for pandas 1.x. In pandas 2.x this will be moot as no more blocks.

jorisvandenbossche commented 6 years ago

I rebased the commit of @jreback to play a bit with it to do some more timings. Using the same example as above, but now excluding the array creation and on more operations:

In [2]: np.random.seed(1234)
   ...: array = np.random.randn(100000,1000)
   ...: array_dict = dict([ (i,np.random.randn(100000)) for i in range(1000) ])

In [3]: # blocks
   ...: df_blocks = DataFrame(array)
   ...: 
   ...: # create a single ndarray, then split it
   ...: df_split = DataFrame(array, policy='split')
   ...: 
   ...: # create a dict of ndarrays, no need to split
   ...: df_dsplit = DataFrame(array_dict, policy='split')

Creation: splitting array much slower, but when you already have dict of arrays it is logically faster to not consolidate them:

In [4]: %timeit DataFrame(array)
102 µs ± 932 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit DataFrame(array, policy='split')
3.01 s ± 93.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit DataFrame(array_dict, policy='split')
29.5 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit DataFrame(array_dict, policy='block')
627 ms ± 4.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Operations column-wise:

In [8]: # blocked
   ...: %timeit df_blocks.sum()
994 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: # splitted
    ...: %timeit df_split.sum()
1.99 s ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Operation row-wise:

In [11]: # blocked
    ...: %timeit df_blocks.sum(axis=1)
889 ms ± 29.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: # splitted
    ...: %timeit df_split.sum(axis=1)
1.85 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Slicing (much slower!):

In [21]: %timeit df_blocks.iloc[0:1000]
106 µs ± 600 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [22]: %timeit df_split.iloc[0:1000]
    ...: 
16.9 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So in general, as expected, it gives a slowdown on the above operations (of course it is the specific case of a lot of all-float columns (only iterating over the rows with eg itertuples was faster with the splitted version ...). But especially the slicing operation is much much slower.

So in any case, if we would like to further explore this, I think it will certainly mean having to refactor the BlockManager to make operations that have to loop over the columns faster.

wesm commented 4 years ago

I've run up against this issue myself (like fighting with my own ghost) in the context of ARROW-3789 as there are use cases where users are converting very large (tens of gigabytes) Arrow tables and wish to avoid excess allocations / memory copying. In certain narrow use cases (i.e. no strings, no nulls), it is actually possible to do fully zero-copy construction of DataFrame from Arrow memory.

This bring up the spectre of unpredictable consolidation. I don't know if there have been any mitigations implemented in the last 4 years about this but I'm interested to see if there are ideas

ARF1 commented 4 years ago

@wesm I had a similar problem with performant boclz integration a few years back. Would this approach help in the meantime? While this is not zero-copy, in my case the memory copy had a negligible performance impact compared to the pandas blocks construction.

This worked for my fairly limited use case then. No thorough testing was done. Also, things may have changed since then. I have not used this in years.

wesm commented 4 years ago

I've seen use cases where the memory doubling will cause OOM, so zero-copy / no consolidation is essential

jorisvandenbossche commented 4 years ago

I don't know if there have been any mitigations implemented in the last 4 years about this but I'm interested to see if there are ideas

Personally, I don't think getting such feature in pandas would be that hard. The commit linked above is not that complicated. And now we have the new metadata machinery ("attrs" attribute), we could maybe use that to attach the "consolidation policy" as metadata to the DataFrame.

If there is some agreement in that we still would want to merge something like this, I could try to revive that branch.


Generally speaking, I think the long term path forward are the 1D ExtensionArray based columns, which directly have the consequence of not getting consolidated (because they are 1D) and moving towards a "collection of 1D arrays" model for the DataFrame. We already have a few such dtypes now, and I think the plan is to keep adding more. But of course, even if we get there eventually, this is long term work (certainly until those would become the default. So personally, I think it might still be worth considering to look into the "no consolidation policy" for the shorter term.

jorisvandenbossche commented 4 years ago

I have updated the branch adding a consolidation policy once again (starting from @TomAugspurger's recent rebase, see the linked commit by github above).

And with that, I experimented a bit comparing a few operations with a consolidated DataFrame vs non-consolidated DataFrame: see notebook at https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c

Of course, for several operations, it is faster to operate on a 2D array compared to operating on many 1D arrays. However, with a decent-sized dataframe (I used 100_000 rows x 100 columns here), the difference is generally not that big (for this size of dataframe, the actual operation dominates the overhead of doing things column-wise). And there were actually also a few operations that are faster column-wise .. (although this could quite likely also be improved to get block-wise as fast, with some time investment).

My general sense (of course, based on very limited tests) is that the slowdown with non-consolidated dataframes for typical operations is not gigantic, and with some effort optimizing the BlockManager for this use case (right now there is often a lot of overhead in finding the correct block type, reconstructing blocks, etc, which is something we should be able to optimize for non-consolidated dataframes) we should be able to remove a large part of this slowdown. And of course, there are also operations that will actually benefit from non-consolidated dataframes (especially where copies can be avoided this way).

jreback commented 4 years ago

@jorisvandenbossche you are missing a big issue here

just operating column wise is not a good test

you need to actually copy the original frame into single columns where the memory gets realigned; taking a view does nothing here which is effectively what you are doing

you will see massive speed ups if you get proper cache access for some times of operations

please see me original PR in non consolidation (a while back)

jorisvandenbossche commented 4 years ago

@jreback I am not really sure what you mean. Can you clarify a bit?

just operating column wise is not a good test

Why not? That's what we would do if we didn't have consolidation? Of course, it's still using 2D blocks to store a 1D array, which is not ideal (so we should be able to optimize this if we actually have 1D blocks), but already doing this should give an idea about overhad of doing things column-wise.

you need to actually copy the original frame into single columns where the memory gets realigned; taking a view does nothing here which is effectively what you are doing

Can you explain what you mean? (maybe a code snippet what I should do) You mean that df2 = pd.DataFrame(arr, policy="split") is not actually properly splitting the array? (as far as I can see, it is actually taking a copy, as the df2 is no longer a view on the original array)

please see me original PR in non consolidation (a while back)

Do you have a link? (I was not aware of a PR about this)

jreback commented 4 years ago

see above : https://github.com/pandas-dev/pandas/issues/10556#issuecomment-143594292

you need to make really sure that ops are actually working with cache aligned data and not with just a view from a 2D array which is not aligned at all, otherwise your timings are not real

jorisvandenbossche commented 4 years ago

see above : #10556 (comment)

I am still not fully clear what you are pointing to. The linked comment explains the added features in your branch (the different policy options, mentioning that policy="split" results in "proper memory layout"), but it is this branch (well, your old branch but rebased on current master) that I have been using for running the notebook.

you need to make really sure that ops are actually working with cache aligned data and not with just a view from a 2D array which is not aligned at all, otherwise your timings are not real

Yes, but as far as I can see, this is the case (the Block's values in the "split" case are properly layout, not a strided view in the original)

jbrockmendel commented 4 years ago

FWIW, i think the lowest-hanging fruit is to not call consolidate_inplace in create_block_manager_from_arrays and create_block_manager_from_blocks. Also some avoidable places like BlockManager.is_numeric_mixed_type

On Mon, May 25, 2020 at 2:21 PM Joris Van den Bossche < notifications@github.com> wrote:

see above : #10556 https://github.com/pandas-dev/pandas/issues/10556 (comment)

I am still not fully clear what you are pointing to. The linked comment explains the added features in your branch (the different policy options, mentioning that policy="split" results in "proper memory layout"), but it is this branch (well, your old branch but rebased on current master) that I have been using for running the notebook.

you need to make really sure that ops are actually working with cache aligned data and not with just a view from a 2D array which is not aligned at all, otherwise your timings are not real

Yes, but as far as I can see, this is the case (the Block's values in the "split" case are properly layout, not a strided view in the original)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633719675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5UM6C66XGTHNK7PORDEJLRTLOOTANCNFSM4BK4FIOA .

jbrockmendel commented 1 year ago

We now fully avoid silent consolidation. Is this still active?

mroeschke commented 1 year ago

This discussion doesn't seem to relevant anymore with the removal of silent consolidation so closing