Closed tristpinsm closed 4 months ago
Is there a correct way to delete a dataset from a container?
You should switch to using draco.analysis.ringmapmaker
. This will save to a new RingMap
container defined in draco
that does not have an rms
dataset. If you are using an old ringmap that has been saved to disk, then there is a task ch_pipeline.analysis.ConvertRingMap
that will convert it to the draco container.
Thanks! I'll give that a shot. The ringmap I was using is /project/rpp-chime/lgray/test_data/pipeline/fullstack/rev_01/stack_p0/ringmap_intercyl.h5
which I thought should be current. I see also that the RingMap
container defined in draco
does allow for an rms
dataset (not initialised by default).
Re. deleting a dataset - how were you trying to do it? __delitem__
for containers is implemented in memh5.py
line 317, and would be called like del ringmap["rms"]
. If that didn't work, then there might be a bug in the implementation
that's what I'm doing in the change proposed here. I don't know what is causing the segfault, may not be related but seemed suspicious.
Why do we have an rms
dataset in the stacking pipeline ringmaps?
Hmm sorry I had a vague memory of us deciding to discontinue use of the rms dataset in favour of the weights dataset when the draco ringmapmaker was created. However it does seem like the BeamformEW creates an rms dataset, which contains duplicated information as the weight dataset.
Ok yeah, that should probably be added as an issue in caput then.
~The stacking pipeline just calls draco.analysis.ringmapmaker.RingMapMaker
, which is essentially just a chain calling MakeVisGrid
, BeamformNS
, and BeamformEW
. Looking at those tasks, BeamformEW
initializes the rms
dataset and writes values to it, so to me it looks like this is the intended behaviour~ Seth already answered
Ah. Deleting the dataset is a red herring. This is the same issue referenced here: https://github.com/radiocosmology/caput/issues/165
Redistributing over any axis other than el
should work fine to write out the file, so I think the fix would be to redistribute back to the "freq" axis at the end of the task
Ah. Deleting the dataset is a red herring. This is the same issue referenced here: radiocosmology/caput#165
Redistributing over any axis other than
el
should work fine to write out the file, so I think the fix would be to redistribute back to the "freq" axis at the end of the task
I didn't read through all of that but will try your suggestion. My pipeline was distributed over 128 ranks, and I'm pretty sure that's less than the size of el
though (?)
I didn't read through all of that but will try your suggestion. My pipeline was distributed over 128 ranks, and I'm pretty sure that's less than the size of
el
though (?)
Yeah, I think the title of that issue was made before we figured out that that wasn't actually the problem. Unfortunately we never managed to figure out what the issue is, other than that it seems to be some sort of memory leak when trying to slice into a dataset that's distributed over its last axis. I'm going to change the name of the issue to something more correct
what's weird is I was using this DayenuDelayFilterMap
task all the time last year and never had this issue.
In any case, seems to work now. Thanks!
I will turn this into a real PR
Does the code crash since it can't distribute rms
over el
or just issue a warning?
That's a good point. It was segfaulting, which I assumed was because it was trying to redistribute over an axis that isn't there, but in fact that may have been a symptom of this other issue. I can try and confirm that.
That's a good point. It was segfaulting, which I assumed was because it was trying to redistribute over an axis that isn't there, but in fact that may have been a symptom of this other issue. I can try and confirm that.
After the segfault, can you use seff
to check the memory usage of the job? I would bet that the crash/segfault is actually due to an OOM, which cedar doesn't always report properly
I can successfully filter ringmaps with only the redistribute
added, so I removed the commit that deletes the rms
dataset.
Redistributing over
el
fails on therms
dataset. I tried removing it, but now I'm getting a segfault when it tries to write out the ringmap to disk. So I think there may still be references to therms
dataset somewhere.Any help figuring this out appreciated.