ptycho / ptypy

Ptypy - main repository
Other
34 stars 14 forks source link

Scaling up #107

Open cmkewish opened 6 years ago

cmkewish commented 6 years ago

I'm looking for suggestions on scaling up ptypy reconstructions to large data sets. I am working on data sets of the order of 100,000 diffraction patterns which begins to become unwieldy with my current reconstruction setup.

I currently have access to a server with 48 cores, 512 GB RAM, which is shared for offline reconstruction among beamline users.

When I process, for example, a dataset with ~23,000 diffraction patterns, ROI 256x256, then each DM iteration takes a few seconds, which is reasonable. But, scaling this up to 10 probe modes, really slows down; it takes ~11 hrs to create PODs before the first iteration begins, and then at least 10 minutes per DM iteration.

I plan to upgrade this computing setup in a few months anyway, because when I try to load a .ptyd file with 160,000 diffraction patterns, 512 GB RAM is insufficient despite the full dataset only being 80 GB prior to ROI extraction.

Do I simply need more CPUs/server, or replicate the setup to N servers, i.e., does anyone have a feel for where the communication overheads start to negatively impact parallelization when using multiple nodes? Previously I have benchmarked ptypy on a cluster and found that about 80 CPUs was OK. Alternatively, one could break the data up into smaller blocks for sequential reconstruction, but this probably doesn't save a lot of time in the long run.

What are your feelings about scaling this up?

Cheers, Cameron


Below is an excerpt from the output when I run a reconstruction of the above dataset on 24 cores:

(ptycho_27) [user@asci analysisSXDM]$ time nice -n 19 mpiexec -n 24 python python/ptypy_153435_26_256_53_to_23280.py 
ASCI compute node: appending python code paths
Verbosity set to 3
Data type:               single

---- Ptycho init level 1 -------------------------------------------------------
Model: sharing probe between scans (one new probe every 1 scan)
Model: sharing object between scans (one new object every 1 scan)

---- Ptycho init level 2 -------------------------------------------------------
No explicit source file was given. Will continue read only.
Processing new data.

---- Enter PtyScan.initialize() ------------------------------------------------
Common weight : False
All experimental positions : False
No scanning position have been provided at this stage.

---- Leaving PtyScan.initialize() ----------------------------------------------
/user/home/.local/lib/python2.7/site-packages/ptypy/core/data.py:629: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if str(cen) == cen:

Many repetitions of the above warning, for comparisons str(cen)==cen, str(p.center)==p.center, center=='fftshift','geometric','fft' ...

ROI center is [127 127], automatic guess is [ 127.13827543  127.95953334].
End of scan reached
End of scan reached
--- Scan scan00 photon report ---
Total photons   : 1.71e+11 
Average photons : 7.37e+06
Maximum photons : 7.93e+06
-----------------------------

---- Creating PODS -------------------------------------------------------------
Found these probes : 
Found these objects: 

[11 hours elapsed time...]

Process 19 created 232290 new PODs, 1 new probes and 1 new objects.
Process 16 created 232290 new PODs, 1 new probes and 1 new objects.
Process 6 created 232290 new PODs, 1 new probes and 1 new objects.
Process 18 created 232290 new PODs, 1 new probes and 1 new objects.
Process 1 created 232290 new PODs, 1 new probes and 1 new objects.
Process 2 created 232290 new PODs, 1 new probes and 1 new objects.
Process 14 created 232290 new PODs, 1 new probes and 1 new objects.
Process 11 created 232290 new PODs, 1 new probes and 1 new objects.
Process 23 created 232290 new PODs, 1 new probes and 1 new objects.
Process 17 created 232290 new PODs, 1 new probes and 1 new objects.
Process 22 created 232290 new PODs, 1 new probes and 1 new objects.
Process 4 created 232290 new PODs, 1 new probes and 1 new objects.
Process 5 created 232290 new PODs, 1 new probes and 1 new objects.
Process 9 created 232290 new PODs, 1 new probes and 1 new objects.
Process 20 created 232290 new PODs, 1 new probes and 1 new objects.
Process 10 created 232290 new PODs, 1 new probes and 1 new objects.
Process 21 created 232290 new PODs, 1 new probes and 1 new objects.
Process 13 created 232290 new PODs, 1 new probes and 1 new objects.
Process 8 created 232290 new PODs, 1 new probes and 1 new objects.
Process 3 created 232290 new PODs, 1 new probes and 1 new objects.
Process 7 created 232290 new PODs, 1 new probes and 1 new objects.
Process 15 created 232290 new PODs, 1 new probes and 1 new objects.
Process 12 created 232290 new PODs, 1 new probes and 1 new objects.
Process 0 created 232290 new PODs, 1 new probes and 1 new objects.

---- Probe initialization ------------------------------------------------------
Initializing probe storage S00G00 using scan scan00.
Found no photon count for probe in parameters.
Using photon count 7.93e+06 from photon report.

---- Object initialization -----------------------------------------------------
Initializing object storage S00G00 using scan scan00.

---- Creating exit waves -------------------------------------------------------

Process #1 ---- Total Pods 232290 (9670 active) ----
--------------------------------------------------------------------------------
(C)ontnr : Memory : Shape            : Pixel size      : Dimensions      : Views
(S)torgs : (MB)   : (Pixel)          : (meters)        : (meters)        : act. 
--------------------------------------------------------------------------------
Cprobe   :    5.2 : complex64
S00G00   :    5.2 :   10 * 256 * 256 : 2.63 * 2.63e-08 :   6.73*6.73e-06 : 32290
Cmask    :   63.4 :   bool
S0000    :   63.4 :  967 * 256 * 256 : 5.40 * 5.40e-05 :   1.38*1.38e-02 :   967
Cexit    : 5069.9 : complex64
S0000G00 : 5069.9 : 9670 * 256 * 256 : 2.63 * 2.63e-08 :   6.73*6.73e-06 :  9670
Cobj     :    5.6 : complex64
S00G00   :    5.6 :    1 * 827 * 849 : 2.63 * 2.63e-08 :   2.17*2.23e-05 : 32290
Cdiff    :  253.5 : float32
S0000    :  253.5 :  967 * 256 * 256 : 5.40 * 5.40e-05 :   1.38*1.38e-02 :   967

[23 similar Process reports...]

---- Ptycho init level 4 -------------------------------------------------------

==== Starting DM-algorithm. ====================================================

Parameter set:
* id3V712V1388           : ptypy.utils.parameters.Param(19)
  * average_interval     : 5
  * overlap_max_itera... : 100
  * fourier_relax_factor : 0.1
  * name                 : DM
  * average_start        : 400
  * overlap_converge_... : 0.05
  * probe_update_start   : 1
  * probe_center_tol     : None
  * numiter              : 500
  * obj_smooth_std       : None
  * update_object_first  : True
  * probe_inertia        : 0.01
  * alpha                : 1
  * numiter_contiguous   : 1
  * clip_object          : None
  * object_inertia       : 0.01
  * subpix_start         : 0
  * probe_support        : 1.0
  * subpix               : linear
================================================================================
Problem sending object {'status': <type 'exceptions.IndexError'>, 'ticket': 2, 'out': None}
/data/xfm/staff/data/2017r2/Paterson_12206f/analysisSXDM/siemens/153435_26/plots/ptypy_153435_26_256_53_to_23280/ptypy_153435_26_256_53_to_23280_None_0000.png
/user/home/.local/lib/python2.7/site-packages/ptypy/utils/plot_client.py:415: MatplotlibDeprecationWarning: The hold function was deprecated in version 2.0.
  plot_fig.hold(False)
/user/home/.local/lib/python2.7/site-packages/ptypy/utils/plot_client.py:418: MatplotlibDeprecationWarning: axes.hold is deprecated.
    See the API Changes document (http://matplotlib.org/api/api_changes.html) 
    for more details.
  pl.hold(False)

Many repetitions of above warnings.

/user/home/.conda/envs/ptycho_27/lib/python2.7/site-packages/matplotlib/backend_bases.py:2453: MatplotlibDeprecationWarning: Using default event loop until function specific to this GUI is implemented
  warnings.warn(str, mplDeprecation)
Time spent in Fourier update: 259.13
Time spent in Overlap update: 46.23
Iteration #1 of DM :: Time 306.08
Errors :: Fourier 2.31e+02, Photons 9.26e+01, Exit 6.11e+01
/data/xfm/staff/data/2017r2/Paterson_12206f/analysisSXDM/siemens/153435_26/plots/ptypy_153435_26_256_53_to_23280/ptypy_153435_26_256_53_to_23280_DM_0001.png
Time spent in Fourier update: 253.18
Time spent in Overlap update: 166.80
Iteration #2 of DM :: Time 420.48
Errors :: Fourier 9.41e+01, Photons 9.06e+01, Exit 1.17e+02
/data/xfm/staff/data/2017r2/Paterson_12206f/analysisSXDM/siemens/153435_26/plots/ptypy_153435_26_256_53_to_23280/ptypy_153435_26_256_53_to_23280_DM_0002.png
Time spent in Fourier update: 255.02
Time spent in Overlap update: 161.95
Iteration #3 of DM :: Time 417.38
[...]
alexbjorling commented 6 years ago

Thanks Cameron for bringing this up, I have no answers but I'm also interested.

I've done some highly non-systematic tests on a much smaller dataset, 1000 diffraction patterns each 128x128, on our cluster which is 20 cores and 128 Gb RAM per node with infiniband. For the first 40 cores the speed is reasonably close to linear with the number of processes (green and gray curves). I'm spreading the processes over all nodes here (-map_by node in mpiexec) and interestingly spreading 40 processes over 8 nodes (gray curve) is faster than saturating two nodes (green curve). To me that's an indication that network communication is not the bottleneck.

Then adding more processes across the 8 nodes (bottom panel) eventually slows the whole thing down, but this is presumably because each process has too few pods to work with (less than 10 pods/process at the performance maximum), not a problem that we're likely to see with bigger datasets.

As I said I'm interested in this for Nanomax's sake too, so if useful I could run tests on our cluster here.

bjoernenders commented 6 years ago

Here is plot that shows how much time it takes to create 100 views, while the x-axis documents the amount of views already present. I was following the instructions on the Tutorial about Views and Containers for this, but just created 100k views in 100 blocks. time_per_view The reason for this appears to be a slow lookup in long python lists. The Base class for all ptypy objects has a lookup built in to check if IDs in the pool have been claimed already. I cached the pool keys in a list attempt to make in faster (because it was even slower before). That wasn't super clever as I was basically bypassing the native (fast) lookup. 30d21cd should fix that. Please tell me if it helped. The plot looks promising. time_per_view_after_fix The spikes correspond maybe to heap allocation for python. Just a guess. Could also be other hickups on my pc.

bjoernenders commented 6 years ago

@alexbjorling Regarding the maxing-out of cores: Does your cluster use Hyperthreading? HPC services has identified that as performance issue here which is why all Intel CPUs here at our cluster have the additional thread per core deactivated.

alexbjorling commented 6 years ago

Yes the cluster uses hyperthreading (because some i/o heavy people insisted on it), but I'm launching mpiexec with one process per physical core, so I think that should be ok.

Like I said, I'd be happy to benchmark bigger reconstructions if useful.

bjoernenders commented 6 years ago

@cmkewish Regarding memory: Since it is DM and not EPIE or ML, the biggest memory consumer is the exit wave buffer (as you can see in the reports above) So there is a factor of 4 for the data type and then it scales with the modes. If you have 160k 128x128 patterns that is roughly 5.3GB of raw data. 10 modes add a factor of 40 so you need 200GB So that fits. I agree, then 256x256 cropping would be problematic with a 512GB setup as it introduces another factor of 4. You could consider splitting the data in 4 sets, or wait until issue #116 comes to a conclusion.

alexbjorling commented 6 years ago

@bjoernenders amazing improvement on the View creation!

alexbjorling commented 6 years ago

With regard to memory, I know I found some memory waste in #60, can't remember now but I wrote that it's a small waste. How does that work, @bjoernenders?

bjoernenders commented 6 years ago

@alexbjorling Thanks! I wonder if it makes a difference for the larger datasets or if the culprit hides somewhere else now. Thanks to @stefanoschalkidis for the thorough class test, that made me confident the fix isn't doing something wrong. Praise the unittests. Yeah regarding #60.... I don't think the object copys matter much when compared to the exit waves in DM. But I agree I don't see it used. Feel free to remove I would say.

bjoernenders commented 6 years ago

@alexbjorling If you are running one core per process, then the only reason for slowdown could be the actual operating system. I usually keep 2 cores for the OS as a reserve. So if a node has 16 physical cores I allow a maximum of 14 processes.

bjoernenders commented 6 years ago

I would also be interested in seeing benchmarks for bigger datasets.

alexbjorling commented 6 years ago

@bjoernenders would you share your timing script with me? Am about to request merging of the 2d/3d storages, but I should test the performance cost first.

bjoernenders commented 6 years ago

Its not much of a script worth sharing.

from ptypy.core.classes import View, Container, DEFAULT_ACCESSRULE
C = Container()
S1 = C.new_storage(shape=(1,7,7)) 
ar = DEFAULT_ACCESSRULE.copy()
ar.shape = (4,4); ar.storageID = S1.ID

and then just a function that creates a 100 Views

def make(num):
    for i in range(num)
        View(C,None,ar)

and then just running make(100) a thousand times and taking time.time() before and afterwards.

bjoernenders commented 1 year ago

I keep this in mind for the next release that focuses on scaling.