Profiling on OpenMP Compute Canada cluster

jcohenadad commented 4 years ago

The OpenMP method (vs. MPI) restricts the number of jobs per node. Currently, 32 jobs is the maximum, meaning that 32 subjects can be processed in parallel. It would be nice to break the processing across chunks of 32 subjects to accelerate the processing.

For the records, on graham, the following config took 7h36min to complete:

number of subjects: 260
number of iterations: 5
number of rescaling: 4

The computation time per subject varies between ~1h and ~2.5h.

Here is the batch log: log_results_csa_t1_20200822.zip

jcohenadad commented 4 years ago

Looking at the timing of outputs for a subject, it seems like the heavy computation time is not related to the segmentation:

-rw-r----- 1 jcohen jcohen   9528 Aug 23 19:42 sub-unf06_T1w_RPI_r_crop_r1_t5_seg.nii.gz
-rw-r----- 1 jcohen jcohen 473647 Aug 23 19:42 sub-unf06_T1w_RPI_r_crop_r1_seg_labeled_t5.nii.gz
-rw-r----- 1 jcohen jcohen 857858 Aug 23 19:42 sub-unf06_T1w_RPI_r_crop_r1_t5.nii.gz
-rw-r----- 1 jcohen jcohen   8719 Aug 23 19:34 sub-unf06_T1w_RPI_r_crop_r1_t4_seg.nii.gz
-rw-r----- 1 jcohen jcohen 445661 Aug 23 19:34 sub-unf06_T1w_RPI_r_crop_r1_seg_labeled_t4.nii.gz
-rw-r----- 1 jcohen jcohen 851394 Aug 23 19:34 sub-unf06_T1w_RPI_r_crop_r1_t4.nii.gz
-rw-r----- 1 jcohen jcohen   9415 Aug 23 19:26 sub-unf06_T1w_RPI_r_crop_r1_t3_seg.nii.gz
-rw-r----- 1 jcohen jcohen 454807 Aug 23 19:26 sub-unf06_T1w_RPI_r_crop_r1_seg_labeled_t3.nii.gz
-rw-r----- 1 jcohen jcohen 850937 Aug 23 19:26 sub-unf06_T1w_RPI_r_crop_r1_t3.nii.gz
-rw-r----- 1 jcohen jcohen   9339 Aug 23 19:23 sub-unf06_T1w_RPI_r_crop_r1_t2_seg.nii.gz
-rw-r----- 1 jcohen jcohen 483162 Aug 23 19:23 sub-unf06_T1w_RPI_r_crop_r1_seg_labeled_t2.nii.gz
-rw-r----- 1 jcohen jcohen 854854 Aug 23 19:23 sub-unf06_T1w_RPI_r_crop_r1_t2.nii.gz
-rw-r----- 1 jcohen jcohen   7398 Aug 23 19:19 sub-unf06_T1w_RPI_r_crop_r1_t1_seg.nii.gz
-rw-r----- 1 jcohen jcohen 384677 Aug 23 19:19 sub-unf06_T1w_RPI_r_crop_r1_seg_labeled_t1.nii.gz
-rw-r----- 1 jcohen jcohen 815639 Aug 23 19:19 sub-unf06_T1w_RPI_r_crop_r1_t1.nii.gz
-rw-r----- 1 jcohen jcohen 566085 Aug 23 19:19 sub-unf06_T1w_RPI_r_crop_r1_seg_labeled.nii.gz
-rw-r----- 1 jcohen jcohen 647170 Aug 23 19:19 sub-unf06_T1w_RPI_r_crop_r1.nii.gz

@PaulBautin any idea where those several minutes (4-9 min depending on the iteration) could come from?

jcohenadad commented 4 years ago

^ my best guesses are:

https://github.com/sct-pipeline/csa-atrophy/blob/28183002509b0f10ce061ae8e3f06d67570546b4/process_data.sh#L173

https://github.com/sct-pipeline/csa-atrophy/blob/28183002509b0f10ce061ae8e3f06d67570546b4/process_data.sh#L176

given that the command below should take no more than a few seconds: https://github.com/sct-pipeline/csa-atrophy/blob/28183002509b0f10ce061ae8e3f06d67570546b4/process_data.sh#L183

PaulBautin commented 4 years ago

The log is surprising! According to my tests (2 subjects), transformations with interpolation of order 5 on cropped images are almost instantly output.

jcohenadad commented 4 years ago

@PaulBautin could you open a branch where you display the time after each command (within the iteration loop) so we know what is taking time? it could be some RAM saturation somewhere.

PaulBautin commented 4 years ago

@PaulBautin could you open a branch where you display the time after each command (within the iteration loop) so we know what is taking time? it could be some RAM saturation somewhere.

Implemented with PR #53 i see no surprising info when running on very small dataset.

sct-pipeline / csa-atrophy

Profiling on OpenMP Compute Canada cluster #52