Scaling up - Githubissues

cmkewish commented 6 years ago

I'm looking for suggestions on scaling up ptypy reconstructions to large data sets. I am working on data sets of the order of 100,000 diffraction patterns which begins to become unwieldy with my current reconstruction setup.

I currently have access to a server with 48 cores, 512 GB RAM, which is shared for offline reconstruction among beamline users.

When I process, for example, a dataset with ~23,000 diffraction patterns, ROI 256x256, then each DM iteration takes a few seconds, which is reasonable. But, scaling this up to 10 probe modes, really slows down; it takes ~11 hrs to create PODs before the first iteration begins, and then at least 10 minutes per DM iteration.

I plan to upgrade this computing setup in a few months anyway, because when I try to load a .ptyd file with 160,000 diffraction patterns, 512 GB RAM is insufficient despite the full dataset only being 80 GB prior to ROI extraction.

Do I simply need more CPUs/server, or replicate the setup to N servers, i.e., does anyone have a feel for where the communication overheads start to negatively impact parallelization when using multiple nodes? Previously I have benchmarked ptypy on a cluster and found that about 80 CPUs was OK. Alternatively, one could break the data up into smaller blocks for sequential reconstruction, but this probably doesn't save a lot of time in the long run.

What are your feelings about scaling this up?

Cheers, Cameron

Below is an excerpt from the output when I run a reconstruction of the above dataset on 24 cores:

(ptycho_27) [user@asci analysisSXDM]$ time nice -n 19 mpiexec -n 24 python python/ptypy_153435_26_256_53_to_23280.py 
ASCI compute node: appending python code paths
Verbosity set to 3
Data type:               single

---- Ptycho init level 1 -------------------------------------------------------
Model: sharing probe between scans (one new probe every 1 scan)
Model: sharing object between scans (one new object every 1 scan)

---- Ptycho init level 2 -------------------------------------------------------
No explicit source file was given. Will continue read only.
Processing new data.

---- Enter PtyScan.initialize() ------------------------------------------------
Common weight : False
All experimental positions : False
No scanning position have been provided at this stage.

---- Leaving PtyScan.initialize() ----------------------------------------------
/user/home/.local/lib/python2.7/site-packages/ptypy/core/data.py:629: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if str(cen) == cen:

Many repetitions of the above warning, for comparisons str(cen)==cen, str(p.center)==p.center, center=='fftshift','geometric','fft' ...

ROI center is [127 127], automatic guess is [ 127.13827543  127.95953334].
End of scan reached
End of scan reached
--- Scan scan00 photon report ---
Total photons   : 1.71e+11 
Average photons : 7.37e+06
Maximum photons : 7.93e+06
-----------------------------

---- Creating PODS -------------------------------------------------------------
Found these probes : 
Found these objects:

[11 hours elapsed time...]

Process 19 created 232290 new PODs, 1 new probes and 1 new objects.
Process 16 created 232290 new PODs, 1 new probes and 1 new objects.
Process 6 created 232290 new PODs, 1 new probes and 1 new objects.
Process 18 created 232290 new PODs, 1 new probes and 1 new objects.
Process 1 created 232290 new PODs, 1 new probes and 1 new objects.
Process 2 created 232290 new PODs, 1 new probes and 1 new objects.
Process 14 created 232290 new PODs, 1 new probes and 1 new objects.
Process 11 created 232290 new PODs, 1 new probes and 1 new objects.
Process 23 created 232290 new PODs, 1 new probes and 1 new objects.
Process 17 created 232290 new PODs, 1 new probes and 1 new objects.
Process 22 created 232290 new PODs, 1 new probes and 1 new objects.
Process 4 created 232290 new PODs, 1 new probes and 1 new objects.
Process 5 created 232290 new PODs, 1 new probes and 1 new objects.
Process 9 created 232290 new PODs, 1 new probes and 1 new objects.
Process 20 created 232290 new PODs, 1 new probes and 1 new objects.
Process 10 created 232290 new PODs, 1 new probes and 1 new objects.
Process 21 created 232290 new PODs, 1 new probes and 1 new objects.
Process 13 created 232290 new PODs, 1 new probes and 1 new objects.
Process 8 created 232290 new PODs, 1 new probes and 1 new objects.
Process 3 created 232290 new PODs, 1 new probes and 1 new objects.
Process 7 created 232290 new PODs, 1 new probes and 1 new objects.
Process 15 created 232290 new PODs, 1 new probes and 1 new objects.
Process 12 created 232290 new PODs, 1 new probes and 1 new objects.
Process 0 created 232290 new PODs, 1 new probes and 1 new objects.

---- Probe initialization ------------------------------------------------------
Initializing probe storage S00G00 using scan scan00.
Found no photon count for probe in parameters.
Using photon count 7.93e+06 from photon report.

---- Object initialization -----------------------------------------------------
Initializing object storage S00G00 using scan scan00.

---- Creating exit waves -------------------------------------------------------

Process #1 ---- Total Pods 232290 (9670 active) ----
--------------------------------------------------------------------------------
(C)ontnr : Memory : Shape            : Pixel size      : Dimensions      : Views
(S)torgs : (MB)   : (Pixel)          : (meters)        : (meters)        : act. 
--------------------------------------------------------------------------------
Cprobe   :    5.2 : complex64
S00G00   :    5.2 :   10 * 256 * 256 : 2.63 * 2.63e-08 :   6.73*6.73e-06 : 32290
Cmask    :   63.4 :   bool
S0000    :   63.4 :  967 * 256 * 256 : 5.40 * 5.40e-05 :   1.38*1.38e-02 :   967
Cexit    : 5069.9 : complex64
S0000G00 : 5069.9 : 9670 * 256 * 256 : 2.63 * 2.63e-08 :   6.73*6.73e-06 :  9670
Cobj     :    5.6 : complex64
S00G00   :    5.6 :    1 * 827 * 849 : 2.63 * 2.63e-08 :   2.17*2.23e-05 : 32290
Cdiff    :  253.5 : float32
S0000    :  253.5 :  967 * 256 * 256 : 5.40 * 5.40e-05 :   1.38*1.38e-02 :   967

[23 similar Process reports...]

---- Ptycho init level 4 -------------------------------------------------------

==== Starting DM-algorithm. ====================================================

Parameter set:
* id3V712V1388           : ptypy.utils.parameters.Param(19)
  * average_interval     : 5
  * overlap_max_itera... : 100
  * fourier_relax_factor : 0.1
  * name                 : DM
  * average_start        : 400
  * overlap_converge_... : 0.05
  * probe_update_start   : 1
  * probe_center_tol     : None
  * numiter              : 500
  * obj_smooth_std       : None
  * update_object_first  : True
  * probe_inertia        : 0.01
  * alpha                : 1
  * numiter_contiguous   : 1
  * clip_object          : None
  * object_inertia       : 0.01
  * subpix_start         : 0
  * probe_support        : 1.0
  * subpix               : linear
================================================================================
Problem sending object {'status': <type 'exceptions.IndexError'>, 'ticket': 2, 'out': None}
/data/xfm/staff/data/2017r2/Paterson_12206f/analysisSXDM/siemens/153435_26/plots/ptypy_153435_26_256_53_to_23280/ptypy_153435_26_256_53_to_23280_None_0000.png
/user/home/.local/lib/python2.7/site-packages/ptypy/utils/plot_client.py:415: MatplotlibDeprecationWarning: The hold function was deprecated in version 2.0.
  plot_fig.hold(False)
/user/home/.local/lib/python2.7/site-packages/ptypy/utils/plot_client.py:418: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html) 
    for more details.
  pl.hold(False)

Many repetitions of above warnings.

/user/home/.conda/envs/ptycho_27/lib/python2.7/site-packages/matplotlib/backend_bases.py:2453: MatplotlibDeprecationWarning: Using default event loop until function specific to this GUI is implemented
  warnings.warn(str, mplDeprecation)
Time spent in Fourier update: 259.13
Time spent in Overlap update: 46.23
Iteration #1 of DM :: Time 306.08
Errors :: Fourier 2.31e+02, Photons 9.26e+01, Exit 6.11e+01
/data/xfm/staff/data/2017r2/Paterson_12206f/analysisSXDM/siemens/153435_26/plots/ptypy_153435_26_256_53_to_23280/ptypy_153435_26_256_53_to_23280_DM_0001.png
Time spent in Fourier update: 253.18
Time spent in Overlap update: 166.80
Iteration #2 of DM :: Time 420.48
Errors :: Fourier 9.41e+01, Photons 9.06e+01, Exit 1.17e+02
/data/xfm/staff/data/2017r2/Paterson_12206f/analysisSXDM/siemens/153435_26/plots/ptypy_153435_26_256_53_to_23280/ptypy_153435_26_256_53_to_23280_DM_0002.png
Time spent in Fourier update: 255.02
Time spent in Overlap update: 161.95
Iteration #3 of DM :: Time 417.38
[...]

alexbjorling commented 6 years ago

Thanks Cameron for bringing this up, I have no answers but I'm also interested.

I've done some highly non-systematic tests on a much smaller dataset, 1000 diffraction patterns each 128x128, on our cluster which is 20 cores and 128 Gb RAM per node with infiniband. For the first 40 cores the speed is reasonably close to linear with the number of processes (green and gray curves). I'm spreading the processes over all nodes here (-map_by node in mpiexec) and interestingly spreading 40 processes over 8 nodes (gray curve) is faster than saturating two nodes (green curve). To me that's an indication that network communication is not the bottleneck.

Then adding more processes across the 8 nodes (bottom panel) eventually slows the whole thing down, but this is presumably because each process has too few pods to work with (less than 10 pods/process at the performance maximum), not a problem that we're likely to see with bigger datasets.

As I said I'm interested in this for Nanomax's sake too, so if useful I could run tests on our cluster here.