Open cmkewish opened 6 years ago
Thanks Cameron for bringing this up, I have no answers but I'm also interested.
I've done some highly non-systematic tests on a much smaller dataset, 1000 diffraction patterns each 128x128, on our cluster which is 20 cores and 128 Gb RAM per node with infiniband. For the first 40 cores the speed is reasonably close to linear with the number of processes (green and gray curves). I'm spreading the processes over all nodes here (-map_by node in mpiexec) and interestingly spreading 40 processes over 8 nodes (gray curve) is faster than saturating two nodes (green curve). To me that's an indication that network communication is not the bottleneck.
Then adding more processes across the 8 nodes (bottom panel) eventually slows the whole thing down, but this is presumably because each process has too few pods to work with (less than 10 pods/process at the performance maximum), not a problem that we're likely to see with bigger datasets.
As I said I'm interested in this for Nanomax's sake too, so if useful I could run tests on our cluster here.
Here is plot that shows how much time it takes to create 100 views, while the x-axis documents the amount of views already present. I was following the instructions on the Tutorial about Views and Containers for this, but just created 100k views in 100 blocks. The reason for this appears to be a slow lookup in long python lists. The Base class for all ptypy objects has a lookup built in to check if IDs in the pool have been claimed already. I cached the pool keys in a list attempt to make in faster (because it was even slower before). That wasn't super clever as I was basically bypassing the native (fast) lookup. 30d21cd should fix that. Please tell me if it helped. The plot looks promising. The spikes correspond maybe to heap allocation for python. Just a guess. Could also be other hickups on my pc.
@alexbjorling Regarding the maxing-out of cores: Does your cluster use Hyperthreading? HPC services has identified that as performance issue here which is why all Intel CPUs here at our cluster have the additional thread per core deactivated.
Yes the cluster uses hyperthreading (because some i/o heavy people insisted on it), but I'm launching mpiexec with one process per physical core, so I think that should be ok.
Like I said, I'd be happy to benchmark bigger reconstructions if useful.
@cmkewish Regarding memory: Since it is DM and not EPIE or ML, the biggest memory consumer is the exit wave buffer (as you can see in the reports above) So there is a factor of 4 for the data type and then it scales with the modes. If you have 160k 128x128 patterns that is roughly 5.3GB of raw data. 10 modes add a factor of 40 so you need 200GB So that fits. I agree, then 256x256 cropping would be problematic with a 512GB setup as it introduces another factor of 4. You could consider splitting the data in 4 sets, or wait until issue #116 comes to a conclusion.
@bjoernenders amazing improvement on the View creation!
With regard to memory, I know I found some memory waste in #60, can't remember now but I wrote that it's a small waste. How does that work, @bjoernenders?
@alexbjorling Thanks! I wonder if it makes a difference for the larger datasets or if the culprit hides somewhere else now. Thanks to @stefanoschalkidis for the thorough class test, that made me confident the fix isn't doing something wrong. Praise the unittests. Yeah regarding #60.... I don't think the object copys matter much when compared to the exit waves in DM. But I agree I don't see it used. Feel free to remove I would say.
@alexbjorling If you are running one core per process, then the only reason for slowdown could be the actual operating system. I usually keep 2 cores for the OS as a reserve. So if a node has 16 physical cores I allow a maximum of 14 processes.
I would also be interested in seeing benchmarks for bigger datasets.
@bjoernenders would you share your timing script with me? Am about to request merging of the 2d/3d storages, but I should test the performance cost first.
Its not much of a script worth sharing.
from ptypy.core.classes import View, Container, DEFAULT_ACCESSRULE
C = Container()
S1 = C.new_storage(shape=(1,7,7))
ar = DEFAULT_ACCESSRULE.copy()
ar.shape = (4,4); ar.storageID = S1.ID
and then just a function that creates a 100 Views
def make(num):
for i in range(num)
View(C,None,ar)
and then just running make(100)
a thousand times and taking time.time()
before and afterwards.
I keep this in mind for the next release that focuses on scaling.
I'm looking for suggestions on scaling up ptypy reconstructions to large data sets. I am working on data sets of the order of 100,000 diffraction patterns which begins to become unwieldy with my current reconstruction setup.
I currently have access to a server with 48 cores, 512 GB RAM, which is shared for offline reconstruction among beamline users.
When I process, for example, a dataset with ~23,000 diffraction patterns, ROI 256x256, then each DM iteration takes a few seconds, which is reasonable. But, scaling this up to 10 probe modes, really slows down; it takes ~11 hrs to create PODs before the first iteration begins, and then at least 10 minutes per DM iteration.
I plan to upgrade this computing setup in a few months anyway, because when I try to load a
.ptyd
file with 160,000 diffraction patterns, 512 GB RAM is insufficient despite the full dataset only being 80 GB prior to ROI extraction.Do I simply need more CPUs/server, or replicate the setup to N servers, i.e., does anyone have a feel for where the communication overheads start to negatively impact parallelization when using multiple nodes? Previously I have benchmarked ptypy on a cluster and found that about 80 CPUs was OK. Alternatively, one could break the data up into smaller blocks for sequential reconstruction, but this probably doesn't save a lot of time in the long run.
What are your feelings about scaling this up?
Cheers, Cameron
Below is an excerpt from the output when I run a reconstruction of the above dataset on 24 cores:
Many repetitions of the above warning, for comparisons str(cen)==cen, str(p.center)==p.center, center=='fftshift','geometric','fft' ...
[11 hours elapsed time...]
[23 similar Process reports...]
Many repetitions of above warnings.