Closed ikrommyd closed 3 months ago
a ping on this one @jpivarski @agoose77 ... this is rather badly noticeable.
I'm trying to reproduce this using only Awkward, both to narrow the scope of search for the bug and to add a test to our test suite. Since you've narrowed in on #3119 as a cause, I think it must have something to do with RecordArrays in IndexedArrays, and ak.argcombinations
or the application of it (i.e. ak.combinations
) is also suspicious because it makes combinations of records by hanging them inside of IndexedArrays. I don't think the issue could be related to any of the calculations on scalars, since #3119 can't have anything to do with non-RecordArrays.
But so far, I haven't been able to reproduce it. I'm starting from a dataset that's already grouped into "electron" and "muon" records, for convenience:
dataset = ak.from_parquet("https://github.com/jpivarski-talks/2024-07-24-codas-hep-ml/raw/main/data/SMHiggsToZZTo4L.parquet")
I make leptons
, so that we have a UnionArray:
leptons = ak.concatenate([dataset.electron, dataset.muon], axis=1)
then make a TypeTracerArray report
to track touching:
layout, report = ak.typetracer.typetracer_with_report(leptons.layout.form_with_key())
and it's initially empty:
print(report.data_touched, report.shape_touched)
# [] []
Slicing through both the electrons and muons touches only the expected buffers:
ak.Array(layout).pt
print(report.data_touched, report.shape_touched)
# ['node1', 'node3', 'node14'] ['node0', 'node1', 'node3', 'node14']
The node0
is the outer ListOffsetArray's offsets, the node1
is the UnionArray tags and index, node3
is the electron pT and node14
is the muon pT.
None of the following touch any more buffers than the above:
ak.num(layout)
pair = ak.argcombinations(leptons, 2, fields=["l1", "l2"])
tmp = leptons[pair.l1]
ak.Array(layout)[pair.l1].pt
ak.Array(layout)[pair.l1].pt == ak.Array(layout)[pair.l2].pt
ak.Array(layout)[pair.l1][:, 0]
Printing arrays to the screen would touch more buffers (hence the precaution of assigning to tmp
) and calculating masses would pull in all of the kinematics (node4
, node5
, node6
, node15
, node16
, node17
), but at that point, we're not working on RecordArrays anymore. I'm looking for something that you did that touches ~everything in this example.
Okay, ak.local_index
touches everything:
ak.local_index(ak.Array(layout))
and I can't think of why that would be. I'm looking into it now...
Actually, the above only touches the shape of everything. Is that consistent with what you were seeing? That the shape of everything was being touched (as opposed to the data)?
But
ak.local_index(ak.Array(layout))
touching the shape of many buffers also happens if I revert #3119, so it might not be your issue.
ak.local_index
has a default argument of axis=-1
, so determining the actual axis depth goes through ak._layout.maybe_posaxis
, and this touches everybody's shape. It happens on this line:
Perhaps branch_depth
shouldn't be touching shape. But before I dig deeper into this, is it your issue? For instance, is the issue caused by your
l3 = ak.local_index(events.leptons)
line? (If you stop computing just before this line, do you get no error? If you stop computing just after this line, do you get the error?)
Also, the maybe_posaxis
/branch_depth
touching would go away with
l3 = ak.local_index(events.leptons, axis=1)
(instead of the default axis=-1
). Does that also make the error go away? If so, then we know the cause and I think I can find a way to fix it. If not, then this maybe_posaxis
/branch_depth
thing might or might not be an issue, but it isn't your issue.
No, this does not fix the issue. :-/
I'll see if I can synthesize an awkward only reproducer.
Got it a little more minimal for now. This is still overtouching
import awkward as ak
import numpy as np
from coffea.nanoevents import NanoEventsFactory
import dask_awkward as dak
events = NanoEventsFactory.from_root(
{
"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"
},
).events()
events["leptons"] = ak.concatenate(
[events.Electron, events.Muon],
axis=1,
)
pair = ak.argcombinations(events.leptons, 2, fields=["l1", "l2"])
pair = pair[
ak.argmin(
(events.leptons[pair.l1] + events.leptons[pair.l2]).pt,
axis=1,
)
]
events = events[ak.num(pair) > 0]
l3 = events.leptons
print(dak.necessary_columns(l3.pt))
print(l3.pt.compute())
Everything here seems improtant. If I try do remove anything, like the events = events[ak.num(pair) > 0]
, the concatenation
and just do events["leptons"] = evens.Electron
or within the argmin
have something like events.leptons[pair.l1].pt
that doesn't include both l1 and l2, overtouchin goes away.
This is also happening. There's overtouching on events.Muon.pt
import awkward as ak
import numpy as np
from coffea.nanoevents import NanoEventsFactory
import dask_awkward as dak
events = NanoEventsFactory.from_root(
{
"https://github.com/CoffeaTeam/coffea/raw/master/tests/samples/nano_dy.root": "Events"
},
).events()
events["leptons"] = ak.concatenate(
[events.Electron, events.Muon],
axis=1,
)
pair = ak.argcombinations(events.leptons, 2, fields=["l1", "l2"])
pair = pair[
ak.argmin(
(events.leptons[pair.l1] + events.leptons[pair.l2]).pt,
axis=1,
)
]
events = events[ak.num(pair) > 0]
print(dak.necessary_columns(events.Muon.pt))
{'from-uproot-85eb9d204817e27e080b72855fb7c5d3': frozenset({'Muon_isTracker', 'Electron_mass', 'Electron_mvaFall17V1noIso', 'nJet', 'Electron_eta', 'Muon_dxy', 'Electron_mvaTTH', 'Muon_softMvaId', 'Muon_sip3d', 'Muon_pfRelIso03_chg', 'nMuon', 'Electron_pfRelIso03_all', 'Muon_isGlobal', 'Muon_tunepRelPt', 'Muon_miniPFRelIso_chg', 'Electron_isPFcand', 'Muon_jetIdx', 'Electron_dz', 'nFsrPhoton', 'Muon_pfRelIso03_all', 'nGenPart', 'Electron_dzErr', 'nPhoton', 'Electron_pdgId', 'Muon_genPartIdx', 'Muon_highPtId', 'Muon_jetPtRelv2', 'Muon_tightCharge', 'Electron_mvaFall17V2Iso_WP80', 'Electron_mvaFall17V2noIso_WP80', 'Electron_convVeto', 'Electron_mvaFall17V1Iso', 'Electron_lostHits', 'Muon_tkIsoId', 'Electron_eCorr', 'Electron_dxy', 'Electron_genPartFlav', 'Muon_mediumPromptId', 'Muon_miniPFRelIso_all', 'Electron_miniPFRelIso_all', 'Muon_pfRelIso04_all', 'Muon_inTimeMuon', 'Electron_jetRelIso', 'Muon_pt', 'Muon_mvaLowPt', 'Electron_cutBased', 'Electron_cutBased_HEEP', 'Muon_fsrPhotonIdx', 'Electron_charge', 'Muon_miniIsoId', 'Muon_mass', 'Electron_cleanmask', 'Electron_mvaFall17V2Iso_WPL', 'Electron_mvaFall17V1noIso_WP90', 'Electron_vidNestedWPBitmapHEEP', 'Electron_pfRelIso03_chg', 'Muon_dz', 'Muon_isPFcand', 'Muon_triggerIdLoose', 'Electron_miniPFRelIso_chg', 'Electron_dr03EcalRecHitSumEt', 'Electron_sieie', 'Muon_tightId', 'Electron_photonIdx', 'Muon_pfIsoId', 'Electron_r9', 'Electron_mvaFall17V1noIso_WP80', 'Electron_mvaFall17V1noIso_WPL', 'Electron_mvaFall17V1Iso_WP90', 'Electron_mvaFall17V1Iso_WP80', 'Muon_multiIsoId', 'Electron_mvaFall17V1Iso_WPL', 'Muon_eta', 'Electron_mvaFall17V2noIso_WP90', 'Electron_phi', 'Muon_tkRelIso', 'Muon_ip3d', 'Electron_jetPtRelv2', 'Electron_cutBased_Fall17_V1', 'Muon_softMva', 'Electron_hoe', 'Muon_jetRelIso', 'Muon_dzErr', 'Muon_genPartFlav', 'Muon_pdgId', 'Electron_jetIdx', 'Muon_dxyErr', 'Electron_energyErr', 'Muon_cleanmask', 'Electron_deltaEtaSC', 'Electron_dxyErr', 'Muon_looseId', 'Muon_charge', 'Muon_mvaTTH', 'Electron_seedGain', 'Electron_mvaFall17V2Iso', 'Electron_ip3d', 'nElectron', 'Electron_dr03HcalDepth1TowerSumEt', 'Electron_sip3d', 'Muon_softId', 'Electron_dr03TkSumPt', 'Electron_dr03TkSumPtHEEP', 'Electron_mvaFall17V2Iso_WP90', 'Electron_mvaFall17V2noIso', 'Muon_ptErr', 'Electron_tightCharge', 'Muon_mvaId', 'Muon_nStations', 'Muon_phi', 'Muon_segmentComp', 'Muon_mediumId', 'Electron_pt', 'Electron_eInvMinusPInv', 'Muon_nTrackerLayers', 'Electron_vidNestedWPBitmap', 'Electron_mvaFall17V2noIso_WPL', 'Electron_genPartIdx'})}
Could https://github.com/dask-contrib/dask-awkward/issues/526 be somehow part of the problem as well since concatenation
and argcombinations
are required?
Here is a minimal reproducer using awkward and one coffea behavior:
import awkward as ak
from coffea.nanoevents.methods import nanoaod
def test_necessary_columns():
with open("nanoevents_form.json", "r") as fin:
form = ak.forms.from_json(fin.read())
events_layout, report = ak.typetracer.typetracer_with_report(form)
events = ak.Array(events_layout, behavior=nanoaod.behavior)
events["leptons"] = ak.concatenate(
[events.Electron, events.Muon],
axis=1,
)
pair = ak.argcombinations(events.leptons, 2, fields=["l1", "l2"])
# looks like all that is needed to trigger it is the sum with behaviors?
ptsum = (events.leptons[pair.l1] + events.leptons[pair.l2]).pt
if len(str(report.data_touched)) > 1000:
print("bad!")
return 1
else:
print("good!")
return 0
exit(test_necessary_columns())
I have attached the json of the input form to this post. nanoevents_form.json
@pfackeldey, @maxgalli, and I looked into this. The over-touching problem comes from broadcasting through a union (the leptons) to apply a ufunc (the np.add
) here:
Before PR #3119, UnionArray.project
replaced a RecordArray with an IndexedArray of RecordArray ("lazily projecting"), such that if the UnionArray only exposed a subset of records, the IndexedArray would only have indexes for those records. Whatever fields the record contained were not touched.
After PR #3119, UnionArray.project
became eager: replacing a RecordArray with a RecordArray of only the records that the UnionArray exposed—the change is here. Since we're slicing the records non-lazily, it must touch all the fields of the record.
PR #3119 was necessary because assignments like cat["another", "w"] = three.x
were causing the memory to explode. As in tests/test_3118_prevent_exponential_memory_growth_in_unionarray.py, every time nested records in a union were assigned to with
cat["another", "w"] = three.x
the size of an internal buffer doubled. (So imagine assigning 20 fields into such a record.) The exponential growth of that internal buffer had something to do with the interplay between broadcasting and lazy projection, and so I fixed it by removing lazy projection.
@pfackeldey is looking into the problem, to see if there's some other way to avoid exponential memory growth of that buffer, while keeping this one project operation lazy so that it doesn't touch all fields.
Both of these problems are tightly coupled to the fact that these are UnionArrays. They wouldn't happen with any other array type.
Thanks for investigating!
Fixed by #3193.
Version of Awkward Array
>=2.6.5
Description and code to reproduce
Reproducer:
The bug is introduced by https://github.com/scikit-hep/awkward/pull/3119
My
git bisect
report: