ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

wisdom calculation uses num_workers even when max_chunks is smaller #420

Closed o-smirnov closed 3 years ago

o-smirnov commented 3 years ago

@francescaLoi has a job that's bailing out on the wisdom calculation based on excessive memory (see also @paoloserra). See this log: https://pastebin.com/2F9EsPNT

Strange thing is, it's got a max-chunks setting of something small, like 1. So only 1 worker (out of the 50 configured) is able to get busy at any one time. But it looks like the wisdom calculation bases its estimate on the full number of workers, and bails out in a most alarmist fashion.

Perhaps the solution is not to fix the wisdom calculation, but to adjust the number of workers down based on the number of chunks available. I remember fiddling with this logic, it gets rather twisted... Hmmm, perhaps I'm already doing this: https://github.com/ratt-ru/CubiCal/blob/master/cubical/main.py#L541. Is it just a question of updating the CubiCal version shipped in Stimela?

(Our log should really contain a version number!)

o-smirnov commented 3 years ago

Ahh I see, while the main section does set max_workers based on the number of chunks, the wisdom calculation blissfully ignores this, and reverts to the config setting. The solution is to pass max_workers to estimate_mem as an explicit parameter, as is done for setup_parallelism.

JSKenyon commented 3 years ago

I am a little confused - fiddling with this locally, and the memory estimation uses workers.num_workers, which seems to correctly change when I specify --dist-ncpu 6 but --dist-max-chunks 1. To be clear, this makes workers.num_workers equal to one. Is it plausible that the version of CubiCal used in the example above predates your changes @o-smirnov?

JSKenyon commented 3 years ago

421 doesn't really change anything - I just added a log statement so that the log will contain the CubiCal version. This will make it easier to tell which version is actually being run.

JSKenyon commented 3 years ago

Actually, of course this predates the fix. I believe it was only merged as part of the polcal stuff.

o-smirnov commented 3 years ago

Ah but of course! And now I see wisdom uses workers.num_workers, which should be the correct revised-down number in this scenario.

So it's just a matter of packaging a new release for @francescaLoi.

JSKenyon commented 3 years ago

Happy to do another release and bug one of the Stimela people about it. Before I do so, would you mind reviewing my super tiny PR? I think having the version printed at the top of the log will make life much easier.

o-smirnov commented 3 years ago

bug one of the Stimela people

That'd be me!

o-smirnov commented 3 years ago

OK @francescaLoi @paoloserra, there's a cubical cab tagged 1.5.8 now. Try it in your pipelines?

francescaLoi commented 3 years ago

So it's just a matter of packaging a new release for @francescaLoi.

I'm really honored!

Pipeline is running, I'll give you an answer asap.

francescaLoi commented 3 years ago

It did the self calibration without memory error now:

# INFO      20:27:19 - main               [0.2 4.4 0.0Gb] [dist] Parallelization and distribution options
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - ncpu .............................................. = 50
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - nworker ........................................... = 0
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - nthread ........................................... = 0
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - max-chunks ........................................ = 4
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - min-chunks ........................................ = 0
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - pin ............................................... = 0
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - pin-io ............................................ = False
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - pin-main .......................................... = io
# INFO      20:27:19 - main               [0.2 4.4 0.0Gb]  - safe .............................................. = 1.0
# INFO      20:27:49 - data_handler       [1.1 9.2 0.0Gb]   generated 270 row chunks based on time and DDID
# INFO      20:27:49 - data_handler       [1.1 9.2 0.0Gb]   row chunks yield 270 potential tiles
# INFO      20:27:51 - data_handler       [1.2 9.3 0.0Gb]   coarsening this to 68 tiles (max 4 chunks per tile, based on 49.0/4 requested)
# INFO      20:27:51 - main               [1.2 9.3 0.0Gb] multi-process mode: 4+1 workers, single thread
# INFO      20:27:51 - wisdom             [1.2 9.3 0.0Gb] Detected a total of 503.78GiB of system memory.
# INFO      20:27:51 - wisdom             [1.2 9.3 0.0Gb] Per-solver (worker) memory use estimated at 10.67GiB: 2.12% of total system memory.
# INFO      20:27:51 - wisdom             [1.2 9.3 0.0Gb] Peak I/O memory use estimated at 15.15GiB: 3.01% of total system memory.
# INFO      20:27:51 - wisdom             [1.2 9.3 0.0Gb] Total peak memory usage estimated at 57.81GiB: 11.48% of total system memory.

@o-smirnov, tell me if you need more details.

o-smirnov commented 3 years ago

Thanks @francescaLoi, this confirms we're all sorted!