Closed bngcebetsha closed 2 months ago
Yeah you killed it... I rebooted, no worries.
Put "ulimit -m ..." in your .bashrc so this doesn't happen again. This puts a memory limit on your jobs so they get killed before they cramp up the system.
@bngcebetsha thanks for reporting. This is a known issue with the dask collections interface and the whole reason why I created the imit
as apposed to the init
app (if you look carefully at the command I sent you you will see it is for the former). They are supposed to do the same thing but one uses collections and the other uses the client interface directly. The next release won't depend on the collections interface for this step anymore and init
will be the the command to use. Sorry if this is a bit confusing, it is only temporary
Put "ulimit -m ..." in your .bashrc so this doesn't happen again. This puts a memory limit on your jobs so they get killed before they cramp up the system.
I should probably set up the dask cluster with a memory limit but I've found that once it starts spilling to disk you might as well kill the run (at least it used to be the case). I probably need to look at this again. In the meantime, I can set the following at startup
import psutil
import resource
mem_total = psutil.virtual_memory().total
_, hardlim = resource.getrlimit(resource.RLIMIT_AS)
resource.setrlimit(resource.RLIMIT_AS, (mem_total, hardlim))
which should raise a memory error if the total memory is exceeded. I'm just not sure if that may have unintended consequences. What do reckon @o-smirnov?
The next release won't depend on the collections interface for this step anymore and init will be the the command to use
This has been done in the band_actors branch you are using @bngcebetsha. Please let me know if you run into issues with this
The next release won't depend on the collections interface for this step anymore and init will be the the command to use
This has been done in the band_actors branch you are using @bngcebetsha. Please let me know if you run into issues with this
Bizarre! I don't see an actual error but the amount of spam produced by dask makes it hard to be sure. Point me at the data and I will take a look. I see it only has 16 chunks to work with (are there only 2 scans in this MS?). I've never tried with fewer chunks than workers, it may be that they are timing out or something. Maybe try with -nw 16
Bizarre! I don't see an actual error but the amount of spam produced by dask makes it hard to be sure. Point me at the data and I will take a look. I see it only has 16 chunks to work with (are there only 2 scans in this MS?). I've never tried with fewer chunks than workers, it may be that they are timing out or something. Maybe try with
-nw 16
Yes there are only 2 scans - I have split out the target field. The data is on young
(path -> bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1k_8s.ms
)
@bngcebetsha I can reproduce your error. It looks like it is indeed because there are more workers tasks (see the workaround suggested in https://github.com/ratt-ru/pfb-imaging/issues/108) so this is a separate issue. Let's keep this one open until I have tested and merged my fix for the original memory issue you ran into. @Athanaseus note that the memory issues you were seeing should be fixed in the version of the init
worker in the band_actors branch
I reduced the number of workers (-nw=8) to match the --channels-per-image
(-cpi=128) value. It looked like it will go well, to avoid the problem reported here - I kept watch on the resources using htop
and the screenshot below shows I might need to tweak other parameters to get this to work.
This step is not fast when reading from a measurement set but I'm surprised at the memory footprint given that this is 1k data. I don't have access to young so I can't look at your data directly. You can reduce the memory footprint using the --integrations-per-image
parameter. How many time integrations do you have per scan?
There are two scans spread over 875533 integrations (not evenly) - I tried to see how the two scan are distributed: scan 1 -> 852840 scan 21 - > 22692
Oh, wow! Are you sure those are integrations and not rows? With an 8s integration time that implies scan 1 is about 1895 hours on source
Oh, wow! Are you sure those are integrations and not rows? With an 8s integration time that implies scan 1 is about 1895 hours on source
you are right that's the number of rows(I assumed that equals number of integrations)
I guess the correct number of integrations is total_time/8s -> 4338.16/8 -> 542.27. We are looking at 271 per scan
Close enough. I'm surprised you are getting close to 500GB memory usage in that case. It could have something to do with the measurement set columns (eg. if one is tiled funny or something) but I would need access to take a closer look. The workaround is still to set -ipi
to something smaller, maybe 100 or so. The grid
worker will concatenate these in time (--concat-row
is true by default) so it doesn't change anything else
Thanks @landmanbester - I'll keep tweaking around, thinks seem to be improving a bit. The data looks like this:
2024-06-23 13:56:56 INFO listobs::ms::summary+ Observation: MeerKAT
2024-06-23 13:56:56 INFO listobs::MSMetaData::_computeScanAndSubScanProperties Computing scan and subscan properties...
2024-06-23 13:56:56 INFO listobs::ms::summary Data records: 875533 Total elapsed time = 4338.16 seconds
2024-06-23 13:56:56 INFO listobs::ms::summary+ Observed from 28-Apr-2019/16:03:41.9 to 28-Apr-2019/17:16:00.0 (UTC)
2024-06-23 13:56:56 INFO listobs::ms::summary
2024-06-23 13:56:56 INFO listobs::ms::summary+ ObservationID = 0 ArrayID = 0
2024-06-23 13:56:56 INFO listobs::ms::summary+ Date Timerange (UTC) Scan FldId FieldName nRows SpwIds Average Interval(s) ScanIntent
2024-06-23 13:56:56 INFO listobs::ms::summary+ 28-Apr-2019/16:03:41.9 - 17:03:42.4 1 0 J1429-6240T 852841 [0] [7.98] [TARGET]
2024-06-23 13:56:56 INFO listobs::ms::summary+ 17:14:26.1 - 17:16:00.0 21 0 J1429-6240T 22692 [0] [7.68] [TARGET]
2024-06-23 13:56:56 INFO listobs::ms::summary (nRows = Total number of rows per scan)
2024-06-23 13:56:56 INFO listobs::ms::summary Fields: 1
I am currently getting this exception thrown:
kwargs: {'dc1': 'CORRECTED_DATA', 'dc2': 'MODEL_DATA', 'operator': '-', 'ds': <xarray.Dataset> Size: 2GB
Dimensions: (row: 189100, uvw: 3, chan: 128, corr: 4)
Coordinates:
ROWID (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
Dimensions without coordinates: row, uvw, chan, corr
Data variables:
ANTENNA1 (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
INTERVAL (row) float64 2MB dask.array<chunksize=(189100,), meta=np.ndarray>
UVW (row, uvw) float64 5MB dask.array<chunksize=(189100, 3), meta=np.ndarray>
TIME (row) float64 2MB dask.array<chunksize=(189100,), meta=np.ndarray>
FLAG (row, chan, corr) bool 97MB dask.array<chunksize=(189100, 128, 4), meta=np.ndarray>
FLAG_ROW (row) bool 189kB dask.array<chunksize=(189100,), meta=np.ndarray>
ANTENNA2 (row) int32 756kB dask.array<chunksize=(189100,), meta=np.ndarray>
WEIGHT_SPECTRUM (row, chan
Exception: 'AttributeError("\'UnicodeType\' object has no attribute \'literal_value\'")'
The init
step proceeds but memory footprint goes beyond 400G so I stop the execution.
This might give more insight:
ovf_result = self._overload_func(*args, **kws)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value'
I tried making my ms as small as possible to at least get something working - it finally threw the above and bailed.
That exception must be coming from numexpr but I have not seen it before. Let me try to reproduce once I have access to the data.
The
init
step proceeds but memory footprint goes beyond 400G so I stop the execution.
I would be curious to find out if the change I made kills the program automatically when it hits max memory. I'm surprised by this but you can reduce the memory footprint further by using fewer workers or smaller time chunks.
ovf_result = self._overload_func(*args, **kws)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value'
That looks like a numba error, I imagine it's not getting called with literally
?
ovf_result = self._overload_func(*args, **kws) File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc) File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs if pol.literal_value == 'linear': AttributeError: 'UnicodeType' object has no attribute 'literal_value'
That looks like a numba error, I imagine it's not getting called with
literally
?
Is this in the error.md file? Not sure how I missed that
Is this in the error.md file? Not sure how I missed that
Bottom of this comment: https://github.com/ratt-ru/pfb-imaging/issues/106#issuecomment-2185001113
Alright. I think I may have an answer. Looking at the tiling
we see that the frequency axis of the CORRECTED_DATA column essentially isn't tiled whereas QC_CORRECTED_DATA is tiled. If I run your command with QC_CORRECTED_DATA i.e.
pfb init --ms /net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1k_8s.ms -ipi -1 -cpi 128 -o output --data-column 'QC_CORRECTED_DATA' --weight-column WEIGHT_SPECTRUM -nw 16 --bda-decorr 0.98 -ldir pfb-logs --overwrite
It runs through in just under 2mins with the expected memory usage
INFO 10:27:21 - INIT | output_filename = output
INFO 10:27:21 - INIT | log_directory = pfb-logs
INFO 10:27:21 - INIT | product = I
INFO 10:27:21 - INIT | ms = ['file:///net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1
k_8s.ms']
INFO 10:27:21 - INIT | scans = None
INFO 10:27:21 - INIT | ddids = None
INFO 10:27:21 - INIT | fields = None
INFO 10:27:21 - INIT | freq_range = None
INFO 10:27:21 - INIT | overwrite = True
INFO 10:27:21 - INIT | radec = None
INFO 10:27:21 - INIT | data_column = QC_CORRECTED_DATA
INFO 10:27:21 - INIT | weight_column = WEIGHT_SPECTRUM
INFO 10:27:21 - INIT | sigma_column = None
INFO 10:27:21 - INIT | flag_column = FLAG
INFO 10:27:21 - INIT | gain_table = None
INFO 10:27:21 - INIT | integrations_per_image = -1
INFO 10:27:21 - INIT | channels_per_image = 128
INFO 10:27:21 - INIT | precision = double
INFO 10:27:21 - INIT | bda_decorr = 0.98
INFO 10:27:21 - INIT | beam_model = None
INFO 10:27:21 - INIT | chan_average = 1
INFO 10:27:21 - INIT | target = None
INFO 10:27:21 - INIT | progressbar = True
INFO 10:27:21 - INIT | host_address = None
INFO 10:27:21 - INIT | nworkers = 16
INFO 10:27:21 - INIT | nthreads_dask = 1
INFO 10:27:21 - INIT | nvthreads = 1
INFO 10:27:21 - INIT | Initialising client with LocalCluster.
INFO 10:27:22 - INIT | Dask Dashboard URL at http://127.0.0.1:8787/status
INFO 10:27:23 - INIT | Overwriting output_I.xds
INFO 10:27:23 - INIT | Data products will be stored in file:///home/bester/projects/buntu/output_I.xds
INFO 10:27:23 - INIT | Constructing mapping
INFO 10:27:24 - INIT | Using weights from WEIGHT_SPECTRUM column
Processing: 16/16
INFO 10:29:18 - INIT | Freq and time selection resulted in 8 output bands and 2 output times
INFO 10:29:20 - INIT | All done after 118.50805354118347s
So I think the problem lies in the tiling of the MS columns. Note that QC_CORRECTED_DATA was created using dask-ms
. @bngcebetsha do you know how the CORRECTED_DATA (and MODEL_DATA for that matter) column was created?
ovf_result = self._overload_func(*args, **kws)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/weighting.py", line 271, in nb_weight_data_impl
vis_func, wgt_func = stokes_funcs(data, jones, product, pol, nc)
File "/home/bngcebetsha/cal_quartical/pfb/pfb-imaging/pfb/utils/stokes.py", line 41, in stokes_funcs
if pol.literal_value == 'linear':
AttributeError: 'UnicodeType' object has no attribute 'literal_value'
Hold on. This is unrelated and probably has something to do with your installation. Please open a separate issue if you still see this after creating a fresh virtualenv and updating pip setuptools wheel before installing pfb
Alright. I think I may have an answer. Looking at the tiling
we see that the frequency axis of the CORRECTED_DATA column essentially isn't tiled whereas QC_CORRECTED_DATA is tiled. If I run your command with QC_CORRECTED_DATA i.e.
pfb init --ms /net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1k_8s.ms -ipi -1 -cpi 128 -o output --data-column 'QC_CORRECTED_DATA' --weight-column WEIGHT_SPECTRUM -nw 16 --bda-decorr 0.98 -ldir pfb-logs --overwrite
It runs through in just under 2mins with the expected memory usage
INFO 10:27:21 - INIT | output_filename = output INFO 10:27:21 - INIT | log_directory = pfb-logs INFO 10:27:21 - INIT | product = I INFO 10:27:21 - INIT | ms = ['file:///net/young/home/bngcebetsha/cal_quartical/1556467257_sdp_l0.f0.2nd_avg.1 k_8s.ms'] INFO 10:27:21 - INIT | scans = None INFO 10:27:21 - INIT | ddids = None INFO 10:27:21 - INIT | fields = None INFO 10:27:21 - INIT | freq_range = None INFO 10:27:21 - INIT | overwrite = True INFO 10:27:21 - INIT | radec = None INFO 10:27:21 - INIT | data_column = QC_CORRECTED_DATA INFO 10:27:21 - INIT | weight_column = WEIGHT_SPECTRUM INFO 10:27:21 - INIT | sigma_column = None INFO 10:27:21 - INIT | flag_column = FLAG INFO 10:27:21 - INIT | gain_table = None INFO 10:27:21 - INIT | integrations_per_image = -1 INFO 10:27:21 - INIT | channels_per_image = 128 INFO 10:27:21 - INIT | precision = double INFO 10:27:21 - INIT | bda_decorr = 0.98 INFO 10:27:21 - INIT | beam_model = None INFO 10:27:21 - INIT | chan_average = 1 INFO 10:27:21 - INIT | target = None INFO 10:27:21 - INIT | progressbar = True INFO 10:27:21 - INIT | host_address = None INFO 10:27:21 - INIT | nworkers = 16 INFO 10:27:21 - INIT | nthreads_dask = 1 INFO 10:27:21 - INIT | nvthreads = 1 INFO 10:27:21 - INIT | Initialising client with LocalCluster. INFO 10:27:22 - INIT | Dask Dashboard URL at http://127.0.0.1:8787/status INFO 10:27:23 - INIT | Overwriting output_I.xds INFO 10:27:23 - INIT | Data products will be stored in file:///home/bester/projects/buntu/output_I.xds INFO 10:27:23 - INIT | Constructing mapping INFO 10:27:24 - INIT | Using weights from WEIGHT_SPECTRUM column Processing: 16/16 INFO 10:29:18 - INIT | Freq and time selection resulted in 8 output bands and 2 output times INFO 10:29:20 - INIT | All done after 118.50805354118347s
So I think the problem lies in the tiling of the MS columns. Note that QC_CORRECTED_DATA was created using
dask-ms
. @bngcebetsha do you know how the CORRECTED_DATA (and MODEL_DATA for that matter) column was created?
interesting - the CORRECTED_DATA
and the MODEL_DATA
columns were created by me, I applied bandpass and gain calibration solutions located here on young (/net/sinatra/vault/oms/ProxCen-data/proxima_buntu_ben
) - derived from when I still worked on the SARAO servers. The fits images I used to derive the model also live in that same path(under J1429-6240
).
The ms I used there became corrupted when I tried to continue work on the Rhodes node. So I split the target field again and started at the applycal
step again. Long story-short, the gain and bandpass solutions were averaged in frequency and time(I will need to check with @bennahugo what the values were)
Ok, so there are a number of options available to you. The easiest is just to do this step in serial (i.e. --nworkers 1
). It may take a bit longer but given that this step only has to be done once upfront that shouldn't matter. Make sure to pull the latest band_workers branch and to
$ pip install -U pip setuptools wheel
before installing pfb
Ok, so there are a number of options available to you. The easiest is just to do this step in serial (i.e.
--nworkers 1
). It may take a bit longer but given that this step only has to be done once upfront that shouldn't matter. Make sure to pull the latest band_workers branch and to$ pip install -U pip setuptools wheel
before installing pfb
worked, and this - https://github.com/ratt-ru/pfb-imaging/issues/109 as you suggested I opened a new issue for that. I might need to close THIS issue soon.
Leave it open please. I will close it once I've confirmed a few things
I was running these steps:
The first step ran indefinitely, so I ran
htop
to see the memory footprint. On the 500G machine (young.ru), the processes I started were approching 100% consumption. I issued aCtrl-C
and waited, the memory remained at 100% and the terminal was unresponsive. I can no longer log back in - I do hope this is something that can be fixed with a simple reboot by Sys Admin.