Closed o-smirnov closed 3 years ago
Ahh I see, while the main section does set max_workers based on the number of chunks, the wisdom calculation blissfully ignores this, and reverts to the config setting. The solution is to pass max_workers
to estimate_mem
as an explicit parameter, as is done for setup_parallelism
.
I am a little confused - fiddling with this locally, and the memory estimation uses workers.num_workers
, which seems to correctly change when I specify --dist-ncpu 6
but --dist-max-chunks 1
. To be clear, this makes workers.num_workers
equal to one. Is it plausible that the version of CubiCal used in the example above predates your changes @o-smirnov?
Actually, of course this predates the fix. I believe it was only merged as part of the polcal stuff.
Ah but of course! And now I see wisdom uses workers.num_workers
, which should be the correct revised-down number in this scenario.
So it's just a matter of packaging a new release for @francescaLoi.
Happy to do another release and bug one of the Stimela people about it. Before I do so, would you mind reviewing my super tiny PR? I think having the version printed at the top of the log will make life much easier.
bug one of the Stimela people
That'd be me!
OK @francescaLoi @paoloserra, there's a cubical cab tagged 1.5.8 now. Try it in your pipelines?
So it's just a matter of packaging a new release for @francescaLoi.
I'm really honored!
Pipeline is running, I'll give you an answer asap.
It did the self calibration without memory error now:
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] [dist] Parallelization and distribution options
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - ncpu .............................................. = 50
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - nworker ........................................... = 0
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - nthread ........................................... = 0
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - max-chunks ........................................ = 4
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - min-chunks ........................................ = 0
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - pin ............................................... = 0
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - pin-io ............................................ = False
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - pin-main .......................................... = io
# INFO 20:27:19 - main [0.2 4.4 0.0Gb] - safe .............................................. = 1.0
# INFO 20:27:49 - data_handler [1.1 9.2 0.0Gb] generated 270 row chunks based on time and DDID
# INFO 20:27:49 - data_handler [1.1 9.2 0.0Gb] row chunks yield 270 potential tiles
# INFO 20:27:51 - data_handler [1.2 9.3 0.0Gb] coarsening this to 68 tiles (max 4 chunks per tile, based on 49.0/4 requested)
# INFO 20:27:51 - main [1.2 9.3 0.0Gb] multi-process mode: 4+1 workers, single thread
# INFO 20:27:51 - wisdom [1.2 9.3 0.0Gb] Detected a total of 503.78GiB of system memory.
# INFO 20:27:51 - wisdom [1.2 9.3 0.0Gb] Per-solver (worker) memory use estimated at 10.67GiB: 2.12% of total system memory.
# INFO 20:27:51 - wisdom [1.2 9.3 0.0Gb] Peak I/O memory use estimated at 15.15GiB: 3.01% of total system memory.
# INFO 20:27:51 - wisdom [1.2 9.3 0.0Gb] Total peak memory usage estimated at 57.81GiB: 11.48% of total system memory.
@o-smirnov, tell me if you need more details.
Thanks @francescaLoi, this confirms we're all sorted!
@francescaLoi has a job that's bailing out on the wisdom calculation based on excessive memory (see also @paoloserra). See this log: https://pastebin.com/2F9EsPNT
Strange thing is, it's got a max-chunks setting of something small, like 1. So only 1 worker (out of the 50 configured) is able to get busy at any one time. But it looks like the wisdom calculation bases its estimate on the full number of workers, and bails out in a most alarmist fashion.
Perhaps the solution is not to fix the wisdom calculation, but to adjust the number of workers down based on the number of chunks available. I remember fiddling with this logic, it gets rather twisted... Hmmm, perhaps I'm already doing this: https://github.com/ratt-ru/CubiCal/blob/master/cubical/main.py#L541. Is it just a question of updating the CubiCal version shipped in Stimela?
(Our log should really contain a version number!)