ifgram_inversion: need command that waits until all dask workers are completed.

falkamelung commented 5 years ago

Hi David, After the dask task I have added a function to move the 40 worker*.e and worker.o files into separate directories as below. However, after pysar is completed there are still some workero file in the pysar directory. It appears that not *.e files exist when ut.move_dask_stdout_stderr_files() is run. Is there a command to wait until all dask workers are completed? I tried future.result() or similar but that did not solve the problem.

            for future, result in as_completed(futures, with_results=True):
                i_future += 1
                print("FUTURE #" + str(i_future), "complete in", time.time() - start_time_subboxes,
                      "seconds. Box:", subbox, "Time:", time.time())
                tsi, temp_cohi, ts_stdi, ifg_numi, subbox = result

                ts[:, subbox[1]:subbox[3], subbox[0]:subbox[2]] = tsi
                ts_std[:, subbox[1]:subbox[3], subbox[0]:subbox[2]] = ts_stdi
                temp_coh[subbox[1]:subbox[3], subbox[0]:subbox[2]] = temp_cohi
                num_inv_ifg[subbox[1]:subbox[3], subbox[0]:subbox[2]] = ifg_numi

            # Shut down Dask workers gracefully
            cluster.close()
            client.close()

        ut.move_dask_stdout_stderr_files()

ehavazli commented 5 years ago

Hello @2gotgrossman @yunjunz ,

I have a question related with dask enabling on local systems. I enabled dask parallelization in my template and ran into some problems in ifgram inversion. First I needed to add cores and memory to line 1131 in ifgram_inversion.py

cluster = LSFCluster(walltime=inps.walltime, python=python_executable_location,cores=1,memory='5 GB')

but after that dask tried to create and use LSF jobs (which is expected with use of LSFCluster). I was wondering if there is a dask for local option which hasn't been implemented yet?

yunjunz commented 5 years ago

Hi @ehavazli, I don't have an answer to your question. @2gotgrossman is more suitable to answer it but he won't be at work anytime soon. All I know is that PBS scheduler can be easily added, not sure about all the other options.

falkamelung commented 5 years ago

I am using it a lot and it works fine for me as it is. The defaults are set in the ./MintPy/mintpy/defaults/dask.yaml I don’t understand why you run into problems.

I am using the option below for larger runs //login3/projects/scratch/insarlab/famelung/KilaueaSenAT124[1086] grep wall /nethome/famelung/insarlab/OPERATIONS/TEMPLATES/KilaueaSenAT124.template mintpy.networkInversion.walltime = 03:00

It will be nice to get it to work under PBS. Should be straightforward. I suspect on most systems you don’t have to worry about the walltime and just use a long one. On pegasus, sometime you get earlier into the queue with short walltimes.

Falk Amelung Professor Department of Marine Geosciences Rosenstiel School of Marine and Atmospheric Sciences University of Miami 4600 Rickenbacker Causeway Miami, FL 33149 USA Tel: 305 421 4949 E-mail: famelung@rsmas.miami.edumailto:famelung@rsmas.miami.edu Web: http://insar.rsmas.miami.edu InSAR data: http://insarmaps.miami.edu

On Jun 25, 2019, at 8:36 PM, ehavazli notifications@github.com<mailto:notifications@github.com> wrote:

Hello @2gotgrossmanhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F2gotgrossman&data=02%7C01%7Cfamelung%40rsmas.miami.edu%7C4802876311ee4a87712808d6f99c0526%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C636970845837183388&sdata=eaKWZC8z7vS23CF22GgYy9ijOxpgYdKeSBn5xeBmZv4%3D&reserved=0 @yunjunzhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyunjunz&data=02%7C01%7Cfamelung%40rsmas.miami.edu%7C4802876311ee4a87712808d6f99c0526%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C636970845837193380&sdata=mOVCGRSW%2FhwsjMaJ%2Bmj3a38yJZJaKcn4x%2FLuhXJExc8%3D&reserved=0 ,

I have a question related with dask enabling on local systems. I enabled dask parallelization in my template and ran into some problems in ifgram inversion. First I needed to add cores and memory to line 1131 in ifgram_inversion.py

cluster = LSFCluster(walltime=inps.walltime, python=python_executable_location,cores=1,memory='5 GB')

but after that dask tried to create and use LSF jobs (which is expected with use of LSFCluster). I was wondering if there is a dask for local option which hasn't been implemented yet?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Finsarlab%2FMintPy%2Fissues%2F73%3Femail_source%3Dnotifications%26email_token%3DACVFHXCZPGRVPTQTPEOV2BTP4JQSLA5CNFSM4HSE37GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYRF6IY%23issuecomment-505569059&data=02%7C01%7Cfamelung%40rsmas.miami.edu%7C4802876311ee4a87712808d6f99c0526%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C636970845837193380&sdata=HYeVbUEF4LO03mE9bxv6k1j2fvM4BWM%2B1N37b4DKgdo%3D&reserved=0, or mute the threadhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACVFHXBRKK35TRNJVRIV7ALP4JQSLANCNFSM4HSE37GA&data=02%7C01%7Cfamelung%40rsmas.miami.edu%7C4802876311ee4a87712808d6f99c0526%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C636970845837203374&sdata=F9vkvPhUQ4nTroJprCmVRBdInIBjAGlvjbl491%2FtWyc%3D&reserved=0.

ehavazli commented 5 years ago

@falkamelung

I am running a test case on my laptop with the following options in my template:

## Parallel processing with Dask for HPC
mintpy.networkInversion.parallel        = yes #[yes / no], auto for no, parallel processing using dask
mintpy.networkInversion.numWorker       = 4   #[int > 0], auto for 40, number of works for dask cluster to use
mintpy.networkInversion.walltime        = 03:00 #[HH:MM], auto for 00:40, walltime for dask workers

when I run smallbaselineApp.py I receive the error below:

split 13372 lines into 4 patches for processing
    with each patch up to 4000 lines
reference pixel in y/x: (10651, 3739) from dataset: unwrapPhase
Traceback (most recent call last):
  File "/Users/havazli/MintPy/mintpy/smallbaselineApp.py", line 1067, in <module>
    main()
  File "/Users/havazli/MintPy/mintpy/smallbaselineApp.py", line 1057, in main
    app.run(steps=inps.runSteps, plot=inps.plot)
  File "/Users/havazli/MintPy/mintpy/smallbaselineApp.py", line 1001, in run
    self.run_network_inversion(sname)
  File "/Users/havazli/MintPy/mintpy/smallbaselineApp.py", line 508, in run_network_inversion
    mintpy.ifgram_inversion.main(scp_args.split())
  File "/Users/havazli/MintPy/mintpy/ifgram_inversion.py", line 1258, in main
    ifgram_inversion(inps.ifgramStackFile, inps)
  File "/Users/havazli/MintPy/mintpy/ifgram_inversion.py", line 1131, in ifgram_inversion
    cluster = LSFCluster(walltime=inps.walltime, python=python_executable_location)
  File "/Users/havazli/miniconda3/envs/ts/lib/python3.7/site-packages/dask_jobqueue/lsf.py", line 89, in __init__
    super(LSFCluster, self).__init__(config_name=config_name, **kwargs)
  File "/Users/havazli/miniconda3/envs/ts/lib/python3.7/site-packages/dask_jobqueue/core.py", line 231, in __init__
    "You must specify how many cores to use per job like ``cores=8``"
ValueError: You must specify how many cores to use per job like ``cores=8``

Did you run into this problem before or am I doing something wrong? I would like to activate dask on my local machine (MacOS) right now and don't have a PBS system.

falkamelung commented 5 years ago

I never tried on a Mac. The testdata that work for me with minsar on pegasus is the file below. I never used the numworker option as a good one is somewhere hardwired in the defaults. If it does not work for you I could run it using your/mine or somebody else account.

https://github.com/geodesymiami/rsmas_insar/blob/master/samples/unittestGalapagosSenDT128.template

Ovec8hkin commented 4 years ago

@ehavazli This is old now, but I have been working on Dask for a bit and have been updating MintPy so as to be more flexible in how it performs its parallel computations. Are you still encountering this problem? I can take a look into how to make MintPy run in parallel locally, as that would be a useful default option for a lot of people that don't have access to an HPC cluster.

yunjunz commented 4 years ago

Hi @Ovec8hkin, NumPy and scipy with open-blas binding could usually take advantage of multiple cores on local laptop, which is installed by default using conda or macports. In my case, the ifgram_inversion.py use ~%400 of CPU. Therefore, I don't see a significant gain from parallel locally and believe this should be a low priority for now.

falkamelung commented 4 years ago

This is something we should look at. A regular mintpy run on Stampede uses only 1 core. But maybe, with some different conda install options we can make it use multiple cores? This would accelerate other portions of the workflow where dask has not yet been implemented (topo residual estimation).

On Apr 4, 2020, at 7:16 PM, Zhang Yunjun notifications@github.com<mailto:notifications@github.com> wrote:

Hi @Ovec8hkinhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOvec8hkin&data=02%7C01%7Cfamelung%40rsmas.miami.edu%7Cb6cd6139c9664fa7307a08d7d8ee23e4%7C2a144b72f23942d48c0e6f0f17c48e33%7C0%7C0%7C637216389632246186&sdata=KjUdeaI2wgut0hdQk6c2WUCA9yC5r2ju5d26Hky6J8A%3D&reserved=0, NumPy and scipy with open-blas binding could usually take advantage of multiple cores on local laptop, which is installed by default using conda or macports. In my case, the ifgram_inversion.py use ~%400 of CPU. Therefore, I don't see a significant gain from parallel locally and believe this should be a low priority for now.

yunjunz commented 4 years ago

Hi @falkamelung, to followup on the main topic of this issue, is it still current?

falkamelung commented 4 years ago

part of this is discussed in a new issue

insarlab / MintPy

ifgram_inversion: need command that waits until all dask workers are completed. #73