scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
829 stars 85 forks source link

Exponential memory growth in UnionArray broadcasting #3118

Closed Teddy-Curtis closed 4 months ago

Teddy-Curtis commented 4 months ago

Version of Awkward Array

2.6.4

Description and code to reproduce

Hey!

To jump right in: I have recently switched from using awkward v1.10.3, to the newest version v2.6.4. I have, however, been running into some very weird issues when adding fields to union arrays.

So a bit of backstory: I am working with electrons and muons, I then try and find dilepton pairs (e-e, mu-mu) and finally I am adding extra information about these dileptons (e.g. angle of dileptons to the missing energy of the event). For some reason, however, after switching to the newer version of awkward I am suddenly running into issues. Specifically, I am running into issues when I add the new fields to the dilepton array. For example, if I try and add new fields like this: dileptons[("Dilepton", "mass")] = <Some array> it will eventually crash from using too much memory, but this issue does not happen when I use awkward v1.10.3. Furthermore, even adding a single field without crashing seems to take longer in v2.6.4 than v1.10.3.

Here is a short snippet that hopefully highlights the issue that I am having. However, instead of having much larger arrays, to show what I mean I am using much smaller ones, and so I have to repeatedly apply the buggy line before the memory gets so big it crashes. In my actual code I am not just repeatedly trying to add the same field!

import awkward as ak
import vector
vector.register_awkward()

# Array of electrons
electrons = ak.Array(
        [[],
        [{'pt': 80.5, 'eta': -1.11, 'phi': 1.46, 'mass': 0.0236, "WP90" : True}, {'pt': 25.9, 'eta': -0.954, 'phi': 1.71, 'mass': 0.00993, "WP90" : True}],
        [{'pt': 40.8, 'eta': 0.449, 'phi': 1.88, 'mass': 0.00444, "WP90" : True}, {'pt': 14.7, 'eta': 0.787, 'phi': 3.04, 'mass': 0.00324, "WP90" : True}],
        [{'pt': 55.3, 'eta': 0.449, 'phi': -1.16, 'mass': -0.0147, "WP90" : True}, {'pt': 40.2, 'eta': 0.664, 'phi': -1.95, 'mass': 0.0129, "WP90" : True}]]
)
muons = ak.Array(
        [[{'pt': 41.2, 'eta': -2.24, 'phi': -0.45, 'mass': 0.106}, {'pt': 39.3, 'eta': -1.65, 'phi': -1.01, 'mass': 0.106, "tightID" : True}],
        [],
        [],
        []]
)

# Convert to Momentum4D vectors so that we can do vector operations e.g. add as 4-vectors
muons_4V = ak.Array(muons, with_name = "Momentum4D")
elecs_4V = ak.Array(electrons, with_name = "Momentum4D")

# Find e-e and mu-mu pairs
mm_pairs = ak.combinations(muons_4V, 2, fields = ["LeadLepton", "SubleadLepton"])
ee_pairs = ak.combinations(elecs_4V, 2, fields = ["LeadLepton", "SubleadLepton"])

# Concatenate these together
dileptons = ak.concatenate([ee_pairs, mm_pairs], axis = 1)

# Add the 4-vectors to get the 4-vector of the dilepton
dileptons["Dilepton"] = dileptons.LeadLepton + dileptons.SubleadLepton 

# 

mass = dileptons.Dilepton.mass

# This eventually crashes! It also goes very slow before it crashes!
for i in range(50):
    dileptons[("Dilepton", "mass")] = mass

I should note that this doesn't crash if I just ran once:

dileptons[("Dilepton", "mass")] = mass

It seems to be the creating of the field that retains something which causes memory to build up after repeated calls and then it crashes.

If I run the above code with the older awkward version then it doesn't crash. If I do this in a jupyter notebook as well, I can't even run

%%timeit
dileptons[("Dilepton", "mass")] = mass

because this ends up crashing before it finishes.

Finally, I should add why I specifically mentioned union arrays. As you can see in the electron and muon fields, the electron array contains the field "WP90" which is not in muons, and the muon array contains the field "tightID" which is not in the electron array. Therefore when you concatenate the ee_pairs and mm_pairs you get a union array:

[[{LeadLepton: {pt: 41.2, eta: -2.24, ...}, SubleadLepton: {...}, ...}],
 [{LeadLepton: {pt: 80.5, eta: -1.11, ...}, SubleadLepton: {...}, ...}],
 [{LeadLepton: {pt: 40.8, eta: 0.449, ...}, SubleadLepton: {...}, ...}],
 [{LeadLepton: {pt: 55.3, eta: 0.449, ...}, SubleadLepton: {...}, ...}]]
------------------------------------------------------------------------
type: 4 * var * union[
    {
        LeadLepton: Momentum4D[
            pt: float64,
            eta: float64,
            phi: float64,
            mass: float64,
            WP90: bool
        ],
        SubleadLepton: Momentum4D[
            pt: float64,
            eta: float64,
            phi: float64,
            mass: float64,
            WP90: bool
        ],
        Dilepton: Momentum4D[
            rho: float64,
            phi: float64,
            eta: float64,
            tau: float64
        ]
    },
    {
        LeadLepton: Momentum4D[
            pt: float64,
            eta: float64,
            phi: float64,
            mass: float64,
            tightID: ?bool
        ],
        SubleadLepton: Momentum4D[
            pt: float64,
            eta: float64,
            phi: float64,
            mass: float64,
            tightID: ?bool
        ],
        Dilepton: Momentum4D[
            rho: float64,
            phi: float64,
            eta: float64,
            tau: float64
        ]
    }
]

If I now take out both the fields "WP90", "tightID" from the electron and muons arrays, respectively, I no longer get a union array (I guess because all the fields match):

[[{LeadLepton: {pt: 41.2, eta: -2.24, ...}, SubleadLepton: {...}, ...}],
 [{LeadLepton: {pt: 80.5, eta: -1.11, ...}, SubleadLepton: {...}, ...}],
 [{LeadLepton: {pt: 40.8, eta: 0.449, ...}, SubleadLepton: {...}, ...}],
 [{LeadLepton: {pt: 55.3, eta: 0.449, ...}, SubleadLepton: {...}, ...}]]
------------------------------------------------------------------------
type: 4 * var * {
    LeadLepton: Momentum4D[
        pt: float64,
        eta: float64,
        phi: float64,
        mass: float64
    ],
    SubleadLepton: Momentum4D[
        pt: float64,
        eta: float64,
        phi: float64,
        mass: float64
    ],
    Dilepton: Momentum4D[
        rho: float64,
        phi: float64,
        eta: float64,
        tau: float64
    ]
}

Now if I run the code above, but with the "WP90", "tightID" removed, it no longer crashes! So I am guessing that it is something to do with the union bit? Also not sure exactly why it becomes a union because as far as I can tell the fields "WP90", "tightID" aren't even accessible through the dilepton array?

Not exactly sure what is going on here, but would be interesting to hear your thoughts!

Thanks!

agoose77 commented 4 months ago

@jpivarski @tcawlfield this is interesting! I think this is a leak / bug of some kind; even manually stepping through the loop eventually blows up my RSS and leads to paging. i.e. it feels non time-dependent (i.e. not gc; I checked manually!) and rather it feels stateful.

I would suspect this might be our growable buffer (due to the stateful-memory aspect), but I haven't done anything beyond checking this reproduces for me on main.

jpivarski commented 4 months ago

Here's a simpler reproducer. I found that the behaviors are unnecessary and the first level of list-nesting is unnecessary (axis=10).

import awkward as ak

one_a = ak.Array([{"x": 1, "y": 2}], with_name="T")
one_b = ak.Array([{"x": 1, "y": 2}], with_name="T")
two_a = ak.Array([{"x": 1, "z": 3}], with_name="T")
two_b = ak.Array([{"x": 1, "z": 3}], with_name="T")
three = ak.Array([{"x": 1}, {"x": 1}], with_name="T")

first = ak.zip({"a": one_a, "b": one_b})
second = ak.zip({"a": two_a, "b": two_b})

cat = ak.concatenate([first, second], axis=0)

cat["another"] = three

for i in range(50):
    cat["another", "w"] = three.x

If you do fewer than 50 steps, you can examine this object. It takes a noticeably long time to show you its layout. The layout itself doesn't look bad, but it's odd that it's unresponsive. The issue must involve arrays that are somehow invisible, and they must get double in each step or something, so that the memory use can actually get large after only 50 steps (some $\mathcal{O}(\exp n)$ scaling).

I don't see how GrowableBuffers could be involved. ArrayBuilder and GrowableBuffers aren't used in the broadcast-and-apply step in the range(50) loop. (We could build one_a, one_b, two_a, two_b, three from layouts to avoid all ArrayBuilders, but I don't think that would help much.)

jpivarski commented 4 months ago

I inserted

import psutil

p = psutil.Process()

at the top and

print(i, p.memory_full_info().uss / 1024**2)

in the loop and got

0 89.3359375
1 89.44140625
2 89.46484375
3 89.46875
4 89.47265625
5 89.484375
6 89.5
7 89.51953125
8 89.53125
9 89.54296875
10 89.546875
11 89.57421875
12 89.6328125
13 89.76171875
14 90.015625
15 90.5234375
16 91.5234375
17 93.5234375
18 97.5234375
19 105.52734375
20 121.52734375
21 153.5078125
22 185.53515625
23 249.53515625
24 377.54296875
25 633.546875
26 1145.546875
27 2169.55078125
28 4217.55078125
29 8313.55078125
30 16505.5546875

It doubles in every step.

jpivarski commented 4 months ago

Just before the for loop, cat has value

[{a: {x: 1, y: 2}, b: {x: 1, y: 2}, another: {x: 1}},
 {a: {x: 1, z: 3}, b: {x: 1, z: 3}, another: {x: 1}}]

type

2 * union[
    {
        a: T[
            x: int64,
            y: int64
        ],
        b: T[
            x: int64,
            y: int64
        ],
        another: T[
            x: int64
        ]
    },
    {
        a: T[
            x: int64,
            z: int64
        ],
        b: T[
            x: int64,
            z: int64
        ],
        another: T[
            x: int64
        ]
    }
]

and layout

<UnionArray len='2'>
    <tags><Index dtype='int8' len='2'>
        [0 1]
    </Index></tags>
    <index><Index dtype='int64' len='2'>
        [0 0]
    </Index></index>
    <content index='0'>
        <RecordArray is_tuple='false' len='1'>
            <content index='0' field='a'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='y'>
                            <NumpyArray dtype='int64' len='1'>[2]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='1' field='b'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='y'>
                            <NumpyArray dtype='int64' len='1'>[2]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='2' field='another'>
                <RecordArray is_tuple='false' len='1'>
                    <parameter name='__record__'>'T'</parameter>
                    <content index='0' field='x'>
                        <IndexedArray len='1'>
                            <index><Index dtype='int64' len='1'>
                                [0]
                            </Index></index>
                            <content><NumpyArray dtype='int64' len='2'>[1 1]</NumpyArray></content>
                        </IndexedArray>
                    </content>
                </RecordArray>
            </content>
        </RecordArray>
    </content>
    <content index='1'>
        <RecordArray is_tuple='false' len='1'>
            <content index='0' field='a'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='z'>
                            <NumpyArray dtype='int64' len='1'>[3]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='1' field='b'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='z'>
                            <NumpyArray dtype='int64' len='1'>[3]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='2' field='another'>
                <RecordArray is_tuple='false' len='1'>
                    <parameter name='__record__'>'T'</parameter>
                    <content index='0' field='x'>
                        <IndexedArray len='1'>
                            <index><Index dtype='int64' len='1'>
                                [1]
                            </Index></index>
                            <content><NumpyArray dtype='int64' len='2'>[1 1]</NumpyArray></content>
                        </IndexedArray>
                    </content>
                </RecordArray>
            </content>
        </RecordArray>
    </content>
</UnionArray>

The NumpyArray for cat["another", "x"] in the first union branch is referentially identical to the one in the second union branch:

>>> (
...     cat.layout.contents[0].contents[2].contents[0].content
...     is cat.layout.contents[1].contents[2].contents[0].content
... )
True

None of that sounds wrong.

jpivarski commented 4 months ago

Surrounding the call with Pympler summary tables unsurprisingly says that it's all in NumPy arrays:

                                       types |   # objects |   total size
============================================ | =========== | ============
                               numpy.ndarray |           5 |     16.00 GB
                                        code |           0 |     14.13 KB
                                        dict |           7 |   1024     B
                       weakref.ReferenceType |           5 |    360     B
                  builtin_function_or_method |           5 |    360     B
                           function (finder) |           2 |    288     B
                       awkward.index.Index64 |           4 |    192     B
  awkward.contents.indexedarray.IndexedArray |           2 |     96     B
                                        list |           1 |     72     B
                    functools._lru_list_elem |           1 |     56     B
                                       tuple |           1 |     48     B
      awkward.contents.numpyarray.NumpyArray |           1 |     48     B
jpivarski commented 4 months ago

Since all NumPy array creations must go through an nplike, here:

https://github.com/scikit-hep/awkward/blob/3317b0c8f2656e47222d03a57bc2b14b1c56227d/src/awkward/_nplikes/array_module.py#L50-L70

I added a line to raise an exception if the new array's length is greater than 1000 (because they grow exponentially). Here's the stack trace:

>>> for i in range(30):
...     cat["another", "w"] = three.x
... 
Traceback (most recent call last):
  File "/home/jpivarski/irishep/awkward/src/awkward/highlevel.py", line 1145, in __setitem__
    self._layout = ak.operations.with_field(
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 64, in dispatch
    next(gen_or_result)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 48, in with_field
    return _impl(array, what, where, highlevel, behavior, attrs)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 68, in _impl
    _impl(base[where[0]], what, where[1:], highlevel, behavior, attrs),
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 529, in _getitem
    return self._getitem_field(where)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/unionarray.py", line 564, in _getitem_field
    return UnionArray.simplified(
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/unionarray.py", line 371, in simplified
    contents[k] = contents[k]._mergemany([self_cont])
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/recordarray.py", line 745, in _mergemany
    merged = forfield[0]._mergemany(forfield[1:])
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/indexedarray.py", line 678, in _mergemany
    nextcontent = contents[0]._mergemany(tail_contents)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/numpyarray.py", line 526, in _mergemany
    next = NumpyArray(
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/numpyarray.py", line 119, in __init__
    self._data = backend.nplike.asarray(data)
  File "/home/jpivarski/irishep/awkward/src/awkward/_nplikes/array_module.py", line 66, in asarray
    raise Exception("HERE")
Exception: HERE

Going from bottom to top, contiguous_arrays is growing out of control here:

https://github.com/scikit-hep/awkward/blob/3317b0c8f2656e47222d03a57bc2b14b1c56227d/src/awkward/contents/numpyarray.py#L506-L524

which comes from the contents here:

https://github.com/scikit-hep/awkward/blob/3317b0c8f2656e47222d03a57bc2b14b1c56227d/src/awkward/contents/indexedarray.py#L604-L678

which are arrays in one of the slots of for_each_field here:

https://github.com/scikit-hep/awkward/blob/3317b0c8f2656e47222d03a57bc2b14b1c56227d/src/awkward/contents/recordarray.py#L675-L745

which is one of the contents here:

https://github.com/scikit-hep/awkward/blob/3317b0c8f2656e47222d03a57bc2b14b1c56227d/src/awkward/contents/unionarray.py#L252-L371

I don't think the issue goes any higher.

jpivarski commented 4 months ago

The exponential memory growth (not a memory leak) is not related to the internal cross-references (the fact that cat.layout.contents[0].contents[2].contents[0].content is cat.layout.contents[1].contents[2].contents[0].content) because if I break that connection by projecting the two IndexedArrays into simple NumpyArrays:

cat.layout.contents[0].contents[2]._contents[0] = cat.layout.contents[0].contents[2].contents[0].project()
cat.layout.contents[1].contents[2]._contents[0] = cat.layout.contents[1].contents[2].contents[0].project()

the issue persists. Implicitly, this also rules out the IndexedArray part of the stack trace.

I doubt it has anything to do with the NumpyArray or RecordArray parts, either. It's very likely an issue in UnionArray.simplified, particularly since the issue only ever happens if the arrays are UnionArrays.

jpivarski commented 4 months ago

Even if we use the above to project out the original IndexedArray, (cat.layout just before the loop):

<UnionArray len='2'>
    <tags><Index dtype='int8' len='2'>
        [0 1]
    </Index></tags>
    <index><Index dtype='int64' len='2'>
        [0 0]
    </Index></index>
    <content index='0'>
        <RecordArray is_tuple='false' len='1'>
            <content index='0' field='a'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='y'>
                            <NumpyArray dtype='int64' len='1'>[2]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='1' field='b'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='y'>
                            <NumpyArray dtype='int64' len='1'>[2]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='2' field='another'>
                <RecordArray is_tuple='false' len='1'>
                    <parameter name='__record__'>'T'</parameter>
                    <content index='0' field='x'>
                        <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                    </content>
                </RecordArray>
            </content>
        </RecordArray>
    </content>
    <content index='1'>
        <RecordArray is_tuple='false' len='1'>
            <content index='0' field='a'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='z'>
                            <NumpyArray dtype='int64' len='1'>[3]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='1' field='b'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='z'>
                            <NumpyArray dtype='int64' len='1'>[3]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='2' field='another'>
                <RecordArray is_tuple='false' len='1'>
                    <parameter name='__record__'>'T'</parameter>
                    <content index='0' field='x'>
                        <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                    </content>
                </RecordArray>
            </content>
        </RecordArray>
    </content>
</UnionArray>

(See? The field='another' part of the outer RecordArray has a one-field RecordArray of just NumpyArray—no IndexedArray.)

Then, running the for loop for 30 steps turns it into

<UnionArray len='2'>
    <tags><Index dtype='int8' len='2'>
        [0 1]
    </Index></tags>
    <index><Index dtype='int64' len='2'>
        [0 0]
    </Index></index>
    <content index='0'>
        <RecordArray is_tuple='false' len='1'>
            <content index='0' field='a'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='y'>
                            <NumpyArray dtype='int64' len='1'>[2]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='1' field='b'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='y'>
                            <NumpyArray dtype='int64' len='1'>[2]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='2' field='another'>
                <RecordArray is_tuple='false' len='1'>
                    <parameter name='__record__'>'T'</parameter>
                    <content index='0' field='x'>
                        <IndexedArray len='1'>
                            <index><Index dtype='int64' len='1'>[0]</Index></index>
                            <content><NumpyArray dtype='int64' len='1073741824'>[1 1 1 ... 1 1 1]</NumpyArray></content>
                        </IndexedArray>
                    </content>
                    <content index='1' field='w'>
                        <IndexedArray len='1'>
                            <index><Index dtype='int64' len='1'>
                                [0]
                            </Index></index>
                            <content><NumpyArray dtype='int64' len='2'>[1 1]</NumpyArray></content>
                        </IndexedArray>
                    </content>
                </RecordArray>
            </content>
        </RecordArray>
    </content>
    <content index='1'>
        <RecordArray is_tuple='false' len='1'>
            <content index='0' field='a'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='z'>
                            <NumpyArray dtype='int64' len='1'>[3]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='1' field='b'>
                <IndexedArray len='1'>
                    <index><Index dtype='int64' len='1'>[0]</Index></index>
                    <content><RecordArray is_tuple='false' len='1'>
                        <parameter name='__record__'>'T'</parameter>
                        <content index='0' field='x'>
                            <NumpyArray dtype='int64' len='1'>[1]</NumpyArray>
                        </content>
                        <content index='1' field='z'>
                            <NumpyArray dtype='int64' len='1'>[3]</NumpyArray>
                        </content>
                    </RecordArray></content>
                </IndexedArray>
            </content>
            <content index='2' field='another'>
                <RecordArray is_tuple='false' len='1'>
                    <parameter name='__record__'>'T'</parameter>
                    <content index='0' field='x'>
                        <IndexedArray len='1'>
                            <index><Index dtype='int64' len='1'>[1073741823]</Index></index>
                            <content><NumpyArray dtype='int64' len='1073741824'>[1 1 1 ... 1 1 1]</NumpyArray></content>
                        </IndexedArray>
                    </content>
                    <content index='1' field='w'>
                        <IndexedArray len='1'>
                            <index><Index dtype='int64' len='1'>
                                [1]
                            </Index></index>
                            <content><NumpyArray dtype='int64' len='2'>[1 1]</NumpyArray></content>
                        </IndexedArray>
                    </content>
                </RecordArray>
            </content>
        </RecordArray>
    </content>
</UnionArray>

The x field has been turned into an IndexedArray whose content has length 2**30, and the IndexedArray just picks out the last element.

jpivarski commented 4 months ago

Projecting away all of the IndexedArrays,

cat.layout.contents[0]._contents[0] = cat.layout.contents[0].contents[0].project()
cat.layout.contents[1]._contents[0] = cat.layout.contents[1].contents[0].project()
cat.layout.contents[0]._contents[1] = cat.layout.contents[0].contents[1].project()
cat.layout.contents[1]._contents[1] = cat.layout.contents[1].contents[1].project()
cat.layout.contents[0].contents[2]._contents[0] = cat.layout.contents[0].contents[2].contents[0].project()
cat.layout.contents[1].contents[2]._contents[0] = cat.layout.contents[1].contents[2].contents[0].project()

and raising an exception the next time an IndexedArray is constructed provides this stack trace:

Traceback (most recent call last):
  File "/home/jpivarski/irishep/awkward/src/awkward/highlevel.py", line 1145, in __setitem__
    self._layout = ak.operations.with_field(
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 64, in dispatch
    next(gen_or_result)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 48, in with_field
    return _impl(array, what, where, highlevel, behavior, attrs)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 68, in _impl
    _impl(base[where[0]], what, where[1:], highlevel, behavior, attrs),
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 529, in _getitem
    return self._getitem_field(where)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/unionarray.py", line 564, in _getitem_field
    return UnionArray.simplified(
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/unionarray.py", line 434, in simplified
    next = contents[0]._carry(index, True)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/recordarray.py", line 510, in _carry
    return ak.contents.IndexedArray(nextindex, self, parameters=None)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/indexedarray.py", line 142, in __init__
    raise Exception("HERE")
Exception: HERE

But that's not it—even with allow_lazy=False in

https://github.com/scikit-hep/awkward/blob/3317b0c8f2656e47222d03a57bc2b14b1c56227d/src/awkward/contents/unionarray.py#L434

we get another IndexedArray creation with this stack trace:

Traceback (most recent call last):
  File "/home/jpivarski/irishep/awkward/src/awkward/highlevel.py", line 1145, in __setitem__
    self._layout = ak.operations.with_field(
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 64, in dispatch
    next(gen_or_result)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 48, in with_field
    return _impl(array, what, where, highlevel, behavior, attrs)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 66, in _impl
    return _impl(
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 158, in _impl
    out = ak._broadcasting.broadcast_and_apply(
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 968, in broadcast_and_apply
    out = apply_step(
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 946, in apply_step
    return continuation()
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 915, in continuation
    return broadcast_any_list()
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 622, in broadcast_any_list
    outcontent = apply_step(
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 946, in apply_step
    return continuation()
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 907, in continuation
    return broadcast_any_union()
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 809, in broadcast_any_union
    nextinputs.append(x[mask].project(next(it_j_contents)))
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/unionarray.py", line 705, in project
    return self._contents[index]._carry(nextcarry, True)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/recordarray.py", line 510, in _carry
    return ak.contents.IndexedArray(nextindex, self, parameters=None)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/indexedarray.py", line 142, in __init__
    raise Exception("HERE")
Exception: HERE

And with allow_lazy=False in

https://github.com/scikit-hep/awkward/blob/3317b0c8f2656e47222d03a57bc2b14b1c56227d/src/awkward/contents/unionarray.py#L705

we get

Traceback (most recent call last):
  File "/home/jpivarski/irishep/awkward/src/awkward/highlevel.py", line 1145, in __setitem__
    self._layout = ak.operations.with_field(
  File "/home/jpivarski/irishep/awkward/src/awkward/_dispatch.py", line 64, in dispatch
    next(gen_or_result)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 48, in with_field
    return _impl(array, what, where, highlevel, behavior, attrs)
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 66, in _impl
    return _impl(
  File "/home/jpivarski/irishep/awkward/src/awkward/operations/ak_with_field.py", line 158, in _impl
    out = ak._broadcasting.broadcast_and_apply(
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 968, in broadcast_and_apply
    out = apply_step(
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 946, in apply_step
    return continuation()
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 915, in continuation
    return broadcast_any_list()
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 622, in broadcast_any_list
    outcontent = apply_step(
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 946, in apply_step
    return continuation()
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 907, in continuation
    return broadcast_any_union()
  File "/home/jpivarski/irishep/awkward/src/awkward/_broadcasting.py", line 811, in broadcast_any_union
    nextinputs.append(x[mask])
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 657, in _getitem
    return self._getitem(layout)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 614, in _getitem
    self._carry(carry, allow_lazy), where.shape
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/recordarray.py", line 510, in _carry
    return ak.contents.IndexedArray(nextindex, self, parameters=None)
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/indexedarray.py", line 142, in __init__
    raise Exception("HERE")
Exception: HERE

Turning off these two allow_lazy options:

--- a/src/awkward/contents/unionarray.py
+++ b/src/awkward/contents/unionarray.py
@@ -431,7 +431,7 @@ class UnionArray(UnionMeta[Content], Content):
             ]

         if len(contents) == 1:
-            next = contents[0]._carry(index, True)
+            next = contents[0]._carry(index, False)
             return next.copy(parameters=parameters_union(next._parameters, parameters))

         else:
@@ -702,7 +702,7 @@ class UnionArray(UnionMeta[Content], Content):
         nextcarry = ak.index.Index64(
             tmpcarry.data[: lenout[0]], nplike=self._backend.index_nplike
         )
-        return self._contents[index]._carry(nextcarry, True)
+        return self._contents[index]._carry(nextcarry, False)

     @staticmethod
     def regular_index(

is sufficient to stop the memory explosion. Turning them both off is also necessary: just one isn't enough to do it.

So now I need to think. Should these be turned off? Well,

Maybe the solution should be highly targeted to this situation: identify and catch content growth in RecordArray._carry with allow_lazy=True?

jpivarski commented 4 months ago

I decided to just set allow_lazy=False. The results won't be incorrect, and I don't think we'll see any bad performance because of this. (I think this is a case of trying it and finding out, since the definition of "good performance" depends on specific test-cases. In particular, real-world test-cases.)