ratt-ru / pfb-imaging

Preconditioned forward/backward clean algorithm
MIT License
6 stars 5 forks source link

Memory flooding #106

Open bngcebetsha opened 1 week ago

bngcebetsha commented 1 week ago

I was running these steps:

pfb init --ms ../1556467257_sdp_l0.f0.2nd_avg.1k_8s.ms -ipi -1 -cpi 128 -o output --data-column 'CORRECTED_DATA-MODEL_DATA' --weight-column WEIGHT_SPECTRUM --scheduler distributed -nw 32 -ntd 1 -nvt 1 --chan-average 1 --bda-decorr 0.98 -ldir pfb-logs

pfb grid -nb 8 -o output --nvthreads 8 -nw 4 -ntd 1 --scheduler distributed --nx 10000 --ny 10000 --cell-size 1.3 -ldir pfb-logs --robustness 0.0 # --overwrite

pfb fluxmop -o output -nb 8 -ldir pfb-logs --cg-tol 0.005 --cg-verbose 2 --cg-report-freq 1 --gamma 1.0 --mask mask4.fits --no-use-psf --memory-greedy --fits-cubes --scheduler threads -ntd 8 -nvt 8 --epsilon 0.0004

The first step ran indefinitely, so I ran htop to see the memory footprint. On the 500G machine (young.ru), the processes I started were approching 100% consumption. I issued a Ctrl-C and waited, the memory remained at 100% and the terminal was unresponsive. I can no longer log back in - I do hope this is something that can be fixed with a simple reboot by Sys Admin.

o-smirnov commented 1 week ago

Yeah you killed it... I rebooted, no worries.

Put "ulimit -m ..." in your .bashrc so this doesn't happen again. This puts a memory limit on your jobs so they get killed before they cramp up the system.

landmanbester commented 1 week ago

@bngcebetsha thanks for reporting. This is a known issue with the dask collections interface and the whole reason why I created the imit as apposed to the init app (if you look carefully at the command I sent you you will see it is for the former). They are supposed to do the same thing but one uses collections and the other uses the client interface directly. The next release won't depend on the collections interface for this step anymore and init will be the the command to use. Sorry if this is a bit confusing, it is only temporary

landmanbester commented 1 week ago

Put "ulimit -m ..." in your .bashrc so this doesn't happen again. This puts a memory limit on your jobs so they get killed before they cramp up the system.

I should probably set up the dask cluster with a memory limit but I've found that once it starts spilling to disk you might as well kill the run (at least it used to be the case). I probably need to look at this again. In the meantime, I can set the following at startup

import psutil
import resource
mem_total = psutil.virtual_memory().total
_, hardlim = resource.getrlimit(resource.RLIMIT_AS)
resource.setrlimit(resource.RLIMIT_AS, (mem_total, hardlim))

which should raise a memory error if the total memory is exceeded. I'm just not sure if that may have unintended consequences. What do reckon @o-smirnov?

landmanbester commented 1 week ago

The next release won't depend on the collections interface for this step anymore and init will be the the command to use

This has been done in the band_actors branch you are using @bngcebetsha. Please let me know if you run into issues with this

bngcebetsha commented 6 days ago

The next release won't depend on the collections interface for this step anymore and init will be the the command to use

This has been done in the band_actors branch you are using @bngcebetsha. Please let me know if you run into issues with this

error.md

landmanbester commented 6 days ago

Bizarre! I don't see an actual error but the amount of spam produced by dask makes it hard to be sure. Point me at the data and I will take a look. I see it only has 16 chunks to work with (are there only 2 scans in this MS?). I've never tried with fewer chunks than workers, it may be that they are timing out or something. Maybe try with -nw 16

bngcebetsha commented 6 days ago

Bizarre! I don't see an actual error but the amount of spam produced by dask makes it hard to be sure. Point me at the data and I will take a look. I see it only has 16 chunks to work with (are there only 2 scans in this MS?). I've never tried with fewer chunks than workers, it may be that they are timing out or something. Maybe try with -nw 16

Yes there are only 2 scans - I have split out the target field. The data is on young (path -> bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1k_8s.ms)

landmanbester commented 6 days ago

@bngcebetsha I can reproduce your error. It looks like it is indeed because there are more workers tasks (see the workaround suggested in https://github.com/ratt-ru/pfb-imaging/issues/108) so this is a separate issue. Let's keep this one open until I have tested and merged my fix for the original memory issue you ran into. @Athanaseus note that the memory issues you were seeing should be fixed in the version of the init worker in the band_actors branch

bngcebetsha commented 6 days ago

I reduced the number of workers (-nw=8) to match the --channels-per-image (-cpi=128) value. It looked like it will go well, to avoid the problem reported here - I kept watch on the resources using htop and the screenshot below shows I might need to tweak other parameters to get this to work.

20240623_112532_crop

landmanbester commented 6 days ago

This step is not fast when reading from a measurement set but I'm surprised at the memory footprint given that this is 1k data. I don't have access to young so I can't look at your data directly. You can reduce the memory footprint using the --integrations-per-image parameter. How many time integrations do you have per scan?

bngcebetsha commented 6 days ago

There are two scans spread over 875533 integrations (not evenly) - I tried to see how the two scan are distributed: scan 1 -> 852840 scan 21 - > 22692

landmanbester commented 6 days ago

Oh, wow! Are you sure those are integrations and not rows? With an 8s integration time that implies scan 1 is about 1895 hours on source

bngcebetsha commented 6 days ago

Oh, wow! Are you sure those are integrations and not rows? With an 8s integration time that implies scan 1 is about 1895 hours on source

you are right that's the number of rows(I assumed that equals number of integrations)

bngcebetsha commented 5 days ago

I guess the correct number of integrations is total_time/8s -> 4338.16/8 -> 542.27. We are looking at 271 per scan

landmanbester commented 5 days ago

Close enough. I'm surprised you are getting close to 500GB memory usage in that case. It could have something to do with the measurement set columns (eg. if one is tiled funny or something) but I would need access to take a closer look. The workaround is still to set -ipi to something smaller, maybe 100 or so. The grid worker will concatenate these in time (--concat-row is true by default) so it doesn't change anything else

bngcebetsha commented 5 days ago

Thanks @landmanbester - I'll keep tweaking around, thinks seem to be improving a bit. The data looks like this:

2024-06-23 13:56:56     INFO    listobs::ms::summary+   Observation: MeerKAT
2024-06-23 13:56:56     INFO    listobs::MSMetaData::_computeScanAndSubScanProperties   Computing scan and subscan properties...
2024-06-23 13:56:56     INFO    listobs::ms::summary    Data records: 875533       Total elapsed time = 4338.16 seconds
2024-06-23 13:56:56     INFO    listobs::ms::summary+      Observed from   28-Apr-2019/16:03:41.9   to   28-Apr-2019/17:16:00.0 (UTC)
2024-06-23 13:56:56     INFO    listobs::ms::summary
2024-06-23 13:56:56     INFO    listobs::ms::summary+      ObservationID = 0         ArrayID = 0
2024-06-23 13:56:56     INFO    listobs::ms::summary+     Date        Timerange (UTC)          Scan  FldId FieldName             nRows     SpwIds   Average Interval(s)    ScanIntent
2024-06-23 13:56:56     INFO    listobs::ms::summary+     28-Apr-2019/16:03:41.9 - 17:03:42.4     1      0 J1429-6240T             852841  [0]  [7.98] [TARGET]
2024-06-23 13:56:56     INFO    listobs::ms::summary+                 17:14:26.1 - 17:16:00.0    21      0 J1429-6240T              22692  [0]  [7.68] [TARGET]
2024-06-23 13:56:56     INFO    listobs::ms::summary               (nRows = Total number of rows per scan)
2024-06-23 13:56:56     INFO    listobs::ms::summary    Fields: 1

I am currently getting this exception thrown:

kwargs:    {'dc1': 'CORRECTED_DATA', 'dc2': 'MODEL_DATA', 'operator': '-', 'ds': <xarray.Dataset> Size: 2GB
Dimensions:          (row: 189100, uvw: 3, chan: 128, corr: 4)
Coordinates:
    ROWID            (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
Dimensions without coordinates: row, uvw, chan, corr
Data variables:
    ANTENNA1         (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
    INTERVAL         (row) float64 2MB dask.array<chunksize=(189100,), meta=np.ndarray>
    UVW              (row, uvw) float64 5MB dask.array<chunksize=(189100, 3), meta=np.ndarray>
    TIME             (row) float64 2MB dask.array<chunksize=(189100,), meta=np.ndarray>
    FLAG             (row, chan, corr) bool 97MB dask.array<chunksize=(189100, 128, 4), meta=np.ndarray>
    FLAG_ROW         (row) bool 189kB dask.array<chunksize=(189100,), meta=np.ndarray>
    ANTENNA2         (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
    WEIGHT_SPECTRUM  (row, chan
Exception: 'AttributeError("\'UnicodeType\' object has no attribute \'literal_value\'")'

The init step proceeds but memory footprint goes beyond 400G so I stop the execution.

This might give more insight:

    ovf_result = self._overload_func(*args, **kws)
  File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
    vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
  File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
    if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value'

I tried making my ms as small as possible to at least get something working - it finally threw the above and bailed.

landmanbester commented 5 days ago

That exception must be coming from numexpr but I have not seen it before. Let me try to reproduce once I have access to the data.

The init step proceeds but memory footprint goes beyond 400G so I stop the execution.

I would be curious to find out if the change I made kills the program automatically when it hits max memory. I'm surprised by this but you can reduce the memory footprint further by using fewer workers or smaller time chunks.

sjperkins commented 5 days ago
    ovf_result = self._overload_func(*args, **kws)
  File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
    vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
  File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
    if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value'

That looks like a numba error, I imagine it's not getting called with literally?

landmanbester commented 5 days ago
    ovf_result = self._overload_func(*args, **kws)
  File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
    vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
  File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
    if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value'

That looks like a numba error, I imagine it's not getting called with literally?

Is this in the error.md file? Not sure how I missed that

sjperkins commented 5 days ago

Is this in the error.md file? Not sure how I missed that

Bottom of this comment: https://github.com/ratt-ru/pfb-imaging/issues/106#issuecomment-2185001113

landmanbester commented 5 days ago

Alright. I think I may have an answer. Looking at the tiling

image

we see that the frequency axis of the CORRECTED_DATA column essentially isn't tiled whereas QC_CORRECTED_DATA is tiled. If I run your command with QC_CORRECTED_DATA i.e.

pfb init --ms /net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1k_8s.ms -ipi -1 -cpi 128 -o output --data-column 'QC_CORRECTED_DATA' --weight-column WEIGHT_SPECTRUM -nw 16 --bda-decorr 0.98 -ldir pfb-logs --overwrite

It runs through in just under 2mins with the expected memory usage

INFO      10:27:21 - INIT               |                output_filename = output
INFO      10:27:21 - INIT               |                  log_directory = pfb-logs
INFO      10:27:21 - INIT               |                        product = I
INFO      10:27:21 - INIT               |                             ms = ['file:///net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1
k_8s.ms']
INFO      10:27:21 - INIT               |                          scans = None
INFO      10:27:21 - INIT               |                          ddids = None
INFO      10:27:21 - INIT               |                         fields = None
INFO      10:27:21 - INIT               |                     freq_range = None
INFO      10:27:21 - INIT               |                      overwrite = True
INFO      10:27:21 - INIT               |                          radec = None
INFO      10:27:21 - INIT               |                    data_column = QC_CORRECTED_DATA
INFO      10:27:21 - INIT               |                  weight_column = WEIGHT_SPECTRUM
INFO      10:27:21 - INIT               |                   sigma_column = None
INFO      10:27:21 - INIT               |                    flag_column = FLAG
INFO      10:27:21 - INIT               |                     gain_table = None
INFO      10:27:21 - INIT               |         integrations_per_image = -1
INFO      10:27:21 - INIT               |             channels_per_image = 128
INFO      10:27:21 - INIT               |                      precision = double
INFO      10:27:21 - INIT               |                     bda_decorr = 0.98
INFO      10:27:21 - INIT               |                     beam_model = None
INFO      10:27:21 - INIT               |                   chan_average = 1
INFO      10:27:21 - INIT               |                         target = None
INFO      10:27:21 - INIT               |                    progressbar = True
INFO      10:27:21 - INIT               |                   host_address = None
INFO      10:27:21 - INIT               |                       nworkers = 16
INFO      10:27:21 - INIT               |                  nthreads_dask = 1
INFO      10:27:21 - INIT               |                      nvthreads = 1
INFO      10:27:21 - INIT               | Initialising client with LocalCluster.
INFO      10:27:22 - INIT               | Dask Dashboard URL at http://127.0.0.1:8787/status
INFO      10:27:23 - INIT               | Overwriting output_I.xds
INFO      10:27:23 - INIT               | Data products will be stored in file:///home/bester/projects/buntu/output_I.xds
INFO      10:27:23 - INIT               | Constructing mapping
INFO      10:27:24 - INIT               | Using weights from WEIGHT_SPECTRUM column
Processing: 16/16

INFO      10:29:18 - INIT               | Freq and time selection resulted in 8 output bands and 2 output times
INFO      10:29:20 - INIT               | All done after 118.50805354118347s

So I think the problem lies in the tiling of the MS columns. Note that QC_CORRECTED_DATA was created using dask-ms. @bngcebetsha do you know how the CORRECTED_DATA (and MODEL_DATA for that matter) column was created?

landmanbester commented 5 days ago
    ovf_result = self._overload_func(*args, **kws)
  File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
    vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
  File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
    if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value'

Hold on. This is unrelated and probably has something to do with your installation. Please open a separate issue if you still see this after creating a fresh virtualenv and updating pip setuptools wheel before installing pfb

bngcebetsha commented 5 days ago

Alright. I think I may have an answer. Looking at the tiling

image

we see that the frequency axis of the CORRECTED_DATA column essentially isn't tiled whereas QC_CORRECTED_DATA is tiled. If I run your command with QC_CORRECTED_DATA i.e.

pfb init --ms /net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1k_8s.ms -ipi -1 -cpi 128 -o output --data-column 'QC_CORRECTED_DATA' --weight-column WEIGHT_SPECTRUM -nw 16 --bda-decorr 0.98 -ldir pfb-logs --overwrite

It runs through in just under 2mins with the expected memory usage

INFO      10:27:21 - INIT               |                output_filename = output
INFO      10:27:21 - INIT               |                  log_directory = pfb-logs
INFO      10:27:21 - INIT               |                        product = I
INFO      10:27:21 - INIT               |                             ms = ['file:///net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1
k_8s.ms']
INFO      10:27:21 - INIT               |                          scans = None
INFO      10:27:21 - INIT               |                          ddids = None
INFO      10:27:21 - INIT               |                         fields = None
INFO      10:27:21 - INIT               |                     freq_range = None
INFO      10:27:21 - INIT               |                      overwrite = True
INFO      10:27:21 - INIT               |                          radec = None
INFO      10:27:21 - INIT               |                    data_column = QC_CORRECTED_DATA
INFO      10:27:21 - INIT               |                  weight_column = WEIGHT_SPECTRUM
INFO      10:27:21 - INIT               |                   sigma_column = None
INFO      10:27:21 - INIT               |                    flag_column = FLAG
INFO      10:27:21 - INIT               |                     gain_table = None
INFO      10:27:21 - INIT               |         integrations_per_image = -1
INFO      10:27:21 - INIT               |             channels_per_image = 128
INFO      10:27:21 - INIT               |                      precision = double
INFO      10:27:21 - INIT               |                     bda_decorr = 0.98
INFO      10:27:21 - INIT               |                     beam_model = None
INFO      10:27:21 - INIT               |                   chan_average = 1
INFO      10:27:21 - INIT               |                         target = None
INFO      10:27:21 - INIT               |                    progressbar = True
INFO      10:27:21 - INIT               |                   host_address = None
INFO      10:27:21 - INIT               |                       nworkers = 16
INFO      10:27:21 - INIT               |                  nthreads_dask = 1
INFO      10:27:21 - INIT               |                      nvthreads = 1
INFO      10:27:21 - INIT               | Initialising client with LocalCluster.
INFO      10:27:22 - INIT               | Dask Dashboard URL at http://127.0.0.1:8787/status
INFO      10:27:23 - INIT               | Overwriting output_I.xds
INFO      10:27:23 - INIT               | Data products will be stored in file:///home/bester/projects/buntu/output_I.xds
INFO      10:27:23 - INIT               | Constructing mapping
INFO      10:27:24 - INIT               | Using weights from WEIGHT_SPECTRUM column
Processing: 16/16

INFO      10:29:18 - INIT               | Freq and time selection resulted in 8 output bands and 2 output times
INFO      10:29:20 - INIT               | All done after 118.50805354118347s

So I think the problem lies in the tiling of the MS columns. Note that QC_CORRECTED_DATA was created using dask-ms. @bngcebetsha do you know how the CORRECTED_DATA (and MODEL_DATA for that matter) column was created?

interesting - the CORRECTED_DATA and the MODEL_DATA columns were created by me, I applied bandpass and gain calibration solutions located here on young (/net/sinatra/vault/oms/ProxCen-data/proxima_buntu_ben) - derived from when I still worked on the SARAO servers. The fits images I used to derive the model also live in that same path(under J1429-6240).

The ms I used there became corrupted when I tried to continue work on the Rhodes node. So I split the target field again and started at the applycal step again. Long story-short, the gain and bandpass solutions were averaged in frequency and time(I will need to check with @bennahugo what the values were)

landmanbester commented 5 days ago

Ok, so there are a number of options available to you. The easiest is just to do this step in serial (i.e. --nworkers 1). It may take a bit longer but given that this step only has to be done once upfront that shouldn't matter. Make sure to pull the latest band_workers branch and to

$ pip install -U pip setuptools wheel

before installing pfb

bngcebetsha commented 5 days ago

Ok, so there are a number of options available to you. The easiest is just to do this step in serial (i.e. --nworkers 1). It may take a bit longer but given that this step only has to be done once upfront that shouldn't matter. Make sure to pull the latest band_workers branch and to

$ pip install -U pip setuptools wheel

before installing pfb

worked, and this - https://github.com/ratt-ru/pfb-imaging/issues/109 as you suggested I opened a new issue for that. I might need to close THIS issue soon.

landmanbester commented 5 days ago

Leave it open please. I will close it once I've confirmed a few things