radiocosmology / draco

A pipeline for the analysis and simulation of drift scan radio data
MIT License
9 stars 7 forks source link

fix(analysis.DayenuDelayFilterMap): Remove rms dset #276

Closed tristpinsm closed 4 months ago

tristpinsm commented 4 months ago

Redistributing over el fails on the rms dataset. I tried removing it, but now I'm getting a segfault when it tries to write out the ringmap to disk. So I think there may still be references to the rms dataset somewhere.

Any help figuring this out appreciated.

tristpinsm commented 4 months ago

Is there a correct way to delete a dataset from a container?

ssiegelx commented 4 months ago

You should switch to using draco.analysis.ringmapmaker. This will save to a new RingMap container defined in draco that does not have an rms dataset. If you are using an old ringmap that has been saved to disk, then there is a task ch_pipeline.analysis.ConvertRingMap that will convert it to the draco container.

tristpinsm commented 4 months ago

Thanks! I'll give that a shot. The ringmap I was using is /project/rpp-chime/lgray/test_data/pipeline/fullstack/rev_01/stack_p0/ringmap_intercyl.h5 which I thought should be current. I see also that the RingMap container defined in draco does allow for an rms dataset (not initialised by default).

ljgray commented 4 months ago

Re. deleting a dataset - how were you trying to do it? __delitem__ for containers is implemented in memh5.py line 317, and would be called like del ringmap["rms"]. If that didn't work, then there might be a bug in the implementation

tristpinsm commented 4 months ago

that's what I'm doing in the change proposed here. I don't know what is causing the segfault, may not be related but seemed suspicious.

Why do we have an rms dataset in the stacking pipeline ringmaps?

ssiegelx commented 4 months ago

Hmm sorry I had a vague memory of us deciding to discontinue use of the rms dataset in favour of the weights dataset when the draco ringmapmaker was created. However it does seem like the BeamformEW creates an rms dataset, which contains duplicated information as the weight dataset.

ljgray commented 4 months ago

Ok yeah, that should probably be added as an issue in caput then.

~The stacking pipeline just calls draco.analysis.ringmapmaker.RingMapMaker, which is essentially just a chain calling MakeVisGrid, BeamformNS, and BeamformEW. Looking at those tasks, BeamformEW initializes the rms dataset and writes values to it, so to me it looks like this is the intended behaviour~ Seth already answered

ljgray commented 4 months ago

Ah. Deleting the dataset is a red herring. This is the same issue referenced here: https://github.com/radiocosmology/caput/issues/165

Redistributing over any axis other than el should work fine to write out the file, so I think the fix would be to redistribute back to the "freq" axis at the end of the task

tristpinsm commented 4 months ago

Ah. Deleting the dataset is a red herring. This is the same issue referenced here: radiocosmology/caput#165

Redistributing over any axis other than el should work fine to write out the file, so I think the fix would be to redistribute back to the "freq" axis at the end of the task

I didn't read through all of that but will try your suggestion. My pipeline was distributed over 128 ranks, and I'm pretty sure that's less than the size of el though (?)

ljgray commented 4 months ago

I didn't read through all of that but will try your suggestion. My pipeline was distributed over 128 ranks, and I'm pretty sure that's less than the size of el though (?)

Yeah, I think the title of that issue was made before we figured out that that wasn't actually the problem. Unfortunately we never managed to figure out what the issue is, other than that it seems to be some sort of memory leak when trying to slice into a dataset that's distributed over its last axis. I'm going to change the name of the issue to something more correct

tristpinsm commented 4 months ago

what's weird is I was using this DayenuDelayFilterMap task all the time last year and never had this issue.

tristpinsm commented 4 months ago

In any case, seems to work now. Thanks!

I will turn this into a real PR

ssiegelx commented 4 months ago

Does the code crash since it can't distribute rms over el or just issue a warning?

tristpinsm commented 4 months ago

That's a good point. It was segfaulting, which I assumed was because it was trying to redistribute over an axis that isn't there, but in fact that may have been a symptom of this other issue. I can try and confirm that.

ljgray commented 4 months ago

That's a good point. It was segfaulting, which I assumed was because it was trying to redistribute over an axis that isn't there, but in fact that may have been a symptom of this other issue. I can try and confirm that.

After the segfault, can you use seff to check the memory usage of the job? I would bet that the crash/segfault is actually due to an OOM, which cedar doesn't always report properly

tristpinsm commented 4 months ago

I can successfully filter ringmaps with only the redistribute added, so I removed the commit that deletes the rms dataset.