sccn / amica

Code for AMICA: Adaptive Mixture ICA with shared components
BSD 2-Clause "Simplified" License
23 stars 13 forks source link

AMICA on distributed systems crashing with large number of models and high density decompositions #40

Open vlawhern opened 1 year ago

vlawhern commented 1 year ago

I'm trying to run AMICA across a large number of nodes in a distributed fashion and I'm finding that for certain combinations of (1) num_models, (2) number of compute nodes and (3) the number of channels of the EEG data that AMICA crashes. I suspect it might be for very large number of compute nodes that there isn't enough data per node to do high density decompositions.

I was wondering if there were general rules about when AMICA would work in a distributed manner for say

T = length of the data N = number of compute nodes M = number of models C = number of channels

Seems like T/N is the amount of data given per node, so it comes to some relationship between M and C for T/N, but it'd be nice if there was some guidance on what values of T/N/M/C would work for distributed AMICA.

japalmer29 commented 1 year ago

Can you give an example of a run that crashes? Have you checked the command line or out.txt for error messages?


From: vlawhern @.> Sent: Thursday, May 4, 2023 11:40:30 AM To: sccn/amica @.> Cc: Subscribed @.***> Subject: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

I'm trying to run AMICA across a large number of nodes in a distributed fashion and I'm finding that for certain combinations of (1) num_models, (2) number of compute nodes and (3) the number of channels of the EEG data that AMICA crashes. I suspect it might be for very large number of compute nodes that there isn't enough data per node to do high density decompositions.

I was wondering if there were general rules about when AMICA would work in a distributed manner for say

T = length of the data N = number of compute nodes M = number of models C = number of channels

Seems like T/N is the amount of data given per node, so it comes to some relationship between M and C for T/N, but it'd be nice if there was some guidance on what values of T/N/M/C would work for distributed AMICA.

— Reply to this email directly, view it on GitHubhttps://github.com/sccn/amica/issues/40, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACRBESQSJVJUTGSO3K2AOADXEPEW5ANCNFSM6AAAAAAXV7FKS4. You are receiving this because you are subscribed to this thread.Message ID: @.***>

vlawhern commented 1 year ago

I get errors from MPI, the most common error I get is this one:

Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 649: MPIU_Object_get_ref((req)) >= 0

Sometimes I'll get this one as well:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
amica15ub          00000000013E61B0  Unknown               Unknown  Unknown
amica15ub          0000000000591297  Unknown               Unknown  Unknown
amica15ub          00000000004D731C  Unknown               Unknown  Unknown
amica15ub          00000000004EFCFA  Unknown               Unknown  Unknown
amica15ub          00000000004F0081  Unknown               Unknown  Unknown
amica15ub          00000000004D64FE  Unknown               Unknown  Unknown
amica15ub          000000000048D296  Unknown               Unknown  Unknown
amica15ub          000000000047B29E  Unknown               Unknown  Unknown
amica15ub          00000000004798B1  Unknown               Unknown  Unknown
amica15ub          000000000047D750  Unknown               Unknown  Unknown
amica15ub          000000000047C875  Unknown               Unknown  Unknown
amica15ub          000000000046E078  Unknown               Unknown  Unknown
amica15ub          0000000000463F98  Unknown               Unknown  Unknown
amica15ub          000000000041E93B  Unknown               Unknown  Unknown
amica15ub          000000000040288D  Unknown               Unknown  Unknown
amica15ub          00000000013DCD4A  Unknown               Unknown  Unknown
amica15ub          00000000013DE5A7  Unknown               Unknown  Unknown
amica15ub          0000000000402765  Unknown               Unknown  Unknown
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7f8c15ece8c0, rbuf=0x7f8c15e8c880, count=32768, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(469).......:
MPIDU_Complete_posted_with_error(1137): Process failed

Seems like this points to not enough data being available for all the MPI workers when the num_models and num_workers are large (in this case AMICA with 8 models and distributed across 8 nodes on 90-channel EEG data, approx 1.8M samples). I've confirmed if I work with less EEG data (90 channel -> 32 channel) this works

vlawhern commented 1 year ago

in addition if I work with less nodes (3 or 4) but still work with high-density data (~64 channel) this also works. So this points to some boundaries based on length of data, number of workers, number of models and number of channels in the EEG data

japalmer29 commented 1 year ago

The number of models should just multiply the run time. Parallelization is only over segmented data as you said. Have you tried with 1 model with 90x1.8M data?


From: vlawhern @.> Sent: Thursday, May 4, 2023 3:57:08 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

I get errors from MPI, the most common error I get is this one:

Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 649: MPIU_Object_get_ref((req)) >= 0

Sometimes I'll get this one as well:

forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source amica15ub 00000000013E61B0 Unknown Unknown Unknown amica15ub 0000000000591297 Unknown Unknown Unknown amica15ub 00000000004D731C Unknown Unknown Unknown amica15ub 00000000004EFCFA Unknown Unknown Unknown amica15ub 00000000004F0081 Unknown Unknown Unknown amica15ub 00000000004D64FE Unknown Unknown Unknown amica15ub 000000000048D296 Unknown Unknown Unknown amica15ub 000000000047B29E Unknown Unknown Unknown amica15ub 00000000004798B1 Unknown Unknown Unknown amica15ub 000000000047D750 Unknown Unknown Unknown amica15ub 000000000047C875 Unknown Unknown Unknown amica15ub 000000000046E078 Unknown Unknown Unknown amica15ub 0000000000463F98 Unknown Unknown Unknown amica15ub 000000000041E93B Unknown Unknown Unknown amica15ub 000000000040288D Unknown Unknown Unknown amica15ub 00000000013DCD4A Unknown Unknown Unknown amica15ub 00000000013DE5A7 Unknown Unknown Unknown amica15ub 0000000000402765 Unknown Unknown Unknown Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7f8c15ece8c0, rbuf=0x7f8c15e8c880, count=32768, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed MPIR_Reduce_impl(1070)................: MPIR_Reduce_intra(868)................: MPIR_Reduce_redscat_gather(469).......: MPIDU_Complete_posted_with_error(1137): Process failed

Seems like this points to not enough data being available for all the MPI workers when the num_models and num_workers are large (in this case AMICA with 8 models and distributed across 8 nodes on 90-channel EEG data, approx 1.8M samples). I've confirmed if I work with less EEG data (90 channel -> 32 channel) this works

— Reply to this email directly, view it on GitHubhttps://github.com/sccn/amica/issues/40#issuecomment-1535335420, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACRBESQSF7SZD6NPYFGD5FLXEQCZJANCNFSM6AAAAAAXV7FKS4. You are receiving this because you commented.Message ID: @.***>

japalmer29 commented 1 year ago

In the Hsu et al paper, Amica was run with more models, nodes, and data than in your case, so I don’t think there is a limit based on the code. There may be an issue with the cluster settings. The error you get is apparently before any output, when the data is being distributed maybe. I thought it output something before doing mpi reduce etc for getting means and covariance.

From: vlawhern @.> Sent: Thursday, May 4, 2023 4:02 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

in addition if I work with less nodes (3 or 4) but still work with high-density data (~64 channel) this also works. So this points to some boundaries based on length of data, number of workers, number of models and number of channels in the EEG data

— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535340673 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESWPQL4VK43BJV6Y5NLXEQDKXANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESVVAA7JPHDQ5DW5NVLXEQDKXA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QNYIC.gif Message ID: @. @.> >

vlawhern commented 1 year ago
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(622).......:
MPIDU_Complete_posted_with_error(1137): Process failed
japalmer29 commented 1 year ago

Do the errors you sent occur immediately, or after some normal output. If the latter, could you send the errors in context?

From: vlawhern @.> Sent: Thursday, May 4, 2023 4:02 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

in addition if I work with less nodes (3 or 4) but still work with high-density data (~64 channel) this also works. So this points to some boundaries based on length of data, number of workers, number of models and number of channels in the EEG data

— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535340673 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESWPQL4VK43BJV6Y5NLXEQDKXANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESVVAA7JPHDQ5DW5NVLXEQDKXA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QNYIC.gif Message ID: @. @.> >

japalmer29 commented 1 year ago

Please paste the whole output from the crash, so I can check the initial output.

From: vlawhern @.> Sent: Thursday, May 4, 2023 4:11 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed MPIR_Reduce_impl(1070)................: MPIR_Reduce_intra(868)................: MPIR_Reduce_redscat_gather(622).......: MPIDU_Complete_posted_with_error(1137): Process failed

— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535350762 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESTJS4HRSVRLVC7IU7TXEQEN3ANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESX5RFZAIRVZYKTQJE3XEQEN3A5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QOL6U.gif Message ID: @. @.> >

vlawhern commented 1 year ago
Processing arguments ...
 num_files =            1
 FILES:
 /mnt/growler/barleyhome/vlawhern/tmpdata63236.fdt
 num_dir_files =            1
 initial matrix block_size =          128
 do_opt_block =            0
 blk_min =          256
 blk_step =          256
 blk_max =         1024
 number of models =            8
 max_thrds =           20
 use_min_dll =            1
 min dll =   1.000000000000000E-009
 use_grad_norm =            1
 min grad norm =   1.000000000000000E-007
 number of density mixture components =            3
 pdf type =            0
 max_iter =         2000
 num_samples =            1
 data_dim =           90
 field_dim =      1744896
 do_history =            0
 histstep =           10
 share_comps =            0
 share_start =          100
 comp_thresh =   0.990000000000000
 share_int =          100
 initial lrate =   5.000000000000000E-002
 minimum lrate =   1.000000000000000E-008
 minimum data covariance eigenvalue =   1.000000000000000E-012
 lrate factor =   0.500000000000000
 initial rholrate =   5.000000000000000E-002
 rho0 =    1.50000000000000
 min rho =    1.00000000000000
 max rho =    2.00000000000000
 rho lrate factor =   0.500000000000000
 kurt_start =            3
 num kurt =            5
 kurt interval =            1
 do_newton =            1
 newt_start =           50
 newt_ramp =           10
 initial newton lrate =    1.00000000000000
 do_reject =            1
 num reject =            3
 reject sigma =    3.00000000000000
 reject start =            2
 reject interval =            3
 write step =           20
 write_nd =            0
 write_LLt =            1
 dec window =            1
 max_decs =            3
 fix_init =            0
 update_A =            1
 update_c =            1
 update_gm =            1
 update_alpha =            1
 update_mu =            1
 update_beta =            1
 invsigmax =    100.000000000000
 invsigmin =   0.000000000000000E+000
 do_rho =            1
 load_rej =            0
 load_c =            0
 load_gm =            0
 load_alpha =            0
 load_mu =            0
 load_beta =            0
 load_rho =            0
 load_comp_list =            0
 do_mean =            1
 do_sphere =            1
 pcakeep =           90
 pcadb =    30.0000000000000
 byte_size =            4
 doscaling =            1
 scalestep =            1
mkdir: cannot create directory '/mnt/growler/barleyhome/vlawhern/amicaouttmp/': File exists
 output directory = /mnt/growler/barleyhome/vlawhern/amicaouttmp/
           1 : setting num_thrds to           20  ...
           2 : setting num_thrds to           20  ...
           1 : using          20 threads.
           2 : using          20 threads.
           1 : node_thrds =           20          20
 bytes in real =            1
           1 : REAL nbyte =            1
 getting segment list ...
 blocks in sample =      1744896
 total blocks =      1744896
 node blocks =       872448      872448
 node            1  start: file            1  sample            1  index
           1
 node            1  stop : file            1  sample            1  index
      872448
 node            2  start: file            1  sample            1  index
      872449
 node            2  stop : file            1  sample            1  index
     1744896
           1 : data =   -2.10404443740845       -2.00284409523010
           2 : data =    2.67275834083557        2.30718922615051
 getting the mean ...
  mean =  -8.140678184955731E-002 -4.924335967420339E-002
  1.604046472164474E-002
 subtracting the mean ...
 getting the covariance matrix ...
 cnt =      1744896
 doing eig nx =           90  lwork =        81000
 doing eig nx =           90  lwork =        81000
 minimum eigenvalues =   3.979936589306554E-003  4.130480204090477E-003
  4.307901288673726E-003
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =           90
 getting the sphering matrix ...
 minimum eigenvalues =   3.979936589306554E-003  4.130480204090477E-003
  4.307901288673726E-003
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =           90
 sphering the data ...
 minimum eigenvalues =   3.979936589306554E-003  4.130480204090477E-003
  4.307901288673726E-003
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =            0
 numeigs =           90
           1 : Allocating variables ...
           2 : Allocating variables ...
           1 : Initializing variables ...
           2 : Initializing variables ...
           1 : block size =          128
           2 : block size =          128
           1 : entering the main loop ...
           2 : entering the main loop ...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
amica15ub          00000000013E61B0  Unknown               Unknown  Unknown
amica15ub          0000000000591297  Unknown               Unknown  Unknown
amica15ub          00000000004D731C  Unknown               Unknown  Unknown
amica15ub          00000000004EFCFA  Unknown               Unknown  Unknown
amica15ub          00000000004F0081  Unknown               Unknown  Unknown
amica15ub          00000000004D64FE  Unknown               Unknown  Unknown
amica15ub          000000000048D296  Unknown               Unknown  Unknown
amica15ub          000000000047B29E  Unknown               Unknown  Unknown
amica15ub          00000000004798B1  Unknown               Unknown  Unknown
amica15ub          000000000047D750  Unknown               Unknown  Unknown
amica15ub          000000000047C875  Unknown               Unknown  Unknown
amica15ub          000000000046E078  Unknown               Unknown  Unknown
amica15ub          0000000000463F98  Unknown               Unknown  Unknown
amica15ub          000000000041E93B  Unknown               Unknown  Unknown
amica15ub          000000000040288D  Unknown               Unknown  Unknown
amica15ub          00000000013DCD4A  Unknown               Unknown  Unknown
amica15ub          00000000013DE5A7  Unknown               Unknown  Unknown
amica15ub          0000000000402765  Unknown               Unknown  Unknown
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(622).......:
MPIDU_Complete_posted_with_error(1137): Process failed
japalmer29 commented 1 year ago

I think the issue is you are using too many threads (max_threads). The number of threads should be the number of cores (or less) on each node. Using more threads may cause problems. Can you try reducing max_threads to 4, 8, or 12?

From: vlawhern @.> Sent: Thursday, May 4, 2023 4:13 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

Processing arguments ... num_files = 1 FILES: /mnt/growler/barleyhome/vlawhern/tmpdata63236.fdt num_dir_files = 1 initial matrix block_size = 128 do_opt_block = 0 blk_min = 256 blk_step = 256 blk_max = 1024 number of models = 8 max_thrds = 20 use_min_dll = 1 min dll = 1.000000000000000E-009 use_grad_norm = 1 min grad norm = 1.000000000000000E-007 number of density mixture components = 3 pdf type = 0 max_iter = 2000 num_samples = 1 data_dim = 90 field_dim = 1744896 do_history = 0 histstep = 10 share_comps = 0 share_start = 100 comp_thresh = 0.990000000000000 share_int = 100 initial lrate = 5.000000000000000E-002 minimum lrate = 1.000000000000000E-008 minimum data covariance eigenvalue = 1.000000000000000E-012 lrate factor = 0.500000000000000 initial rholrate = 5.000000000000000E-002 rho0 = 1.50000000000000 min rho = 1.00000000000000 max rho = 2.00000000000000 rho lrate factor = 0.500000000000000 kurt_start = 3 num kurt = 5 kurt interval = 1 do_newton = 1 newt_start = 50 newt_ramp = 10 initial newton lrate = 1.00000000000000 do_reject = 1 num reject = 3 reject sigma = 3.00000000000000 reject start = 2 reject interval = 3 write step = 20 write_nd = 0 write_LLt = 1 dec window = 1 max_decs = 3 fix_init = 0 update_A = 1 update_c = 1 update_gm = 1 update_alpha = 1 update_mu = 1 update_beta = 1 invsigmax = 100.000000000000 invsigmin = 0.000000000000000E+000 do_rho = 1 load_rej = 0 load_c = 0 load_gm = 0 load_alpha = 0 load_mu = 0 load_beta = 0 load_rho = 0 load_comp_list = 0 do_mean = 1 do_sphere = 1 pcakeep = 90 pcadb = 30.0000000000000 byte_size = 4 doscaling = 1 scalestep = 1 mkdir: cannot create directory '/mnt/growler/barleyhome/vlawhern/amicaouttmp/': File exists output directory = /mnt/growler/barleyhome/vlawhern/amicaouttmp/ 1 : setting num_thrds to 20 ... 2 : setting num_thrds to 20 ... 1 : using 20 threads. 2 : using 20 threads. 1 : node_thrds = 20 20 bytes in real = 1 1 : REAL nbyte = 1 getting segment list ... blocks in sample = 1744896 total blocks = 1744896 node blocks = 872448 872448 node 1 start: file 1 sample 1 index 1 node 1 stop : file 1 sample 1 index 872448 node 2 start: file 1 sample 1 index 872449 node 2 stop : file 1 sample 1 index 1744896 1 : data = -2.10404443740845 -2.00284409523010 2 : data = 2.67275834083557 2.30718922615051 getting the mean ... mean = -8.140678184955731E-002 -4.924335967420339E-002 1.604046472164474E-002 subtracting the mean ... getting the covariance matrix ... cnt = 1744896 doing eig nx = 90 lwork = 81000 doing eig nx = 90 lwork = 81000 minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 getting the sphering matrix ... minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 sphering the data ... minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 numeigs = 90 1 : Allocating variables ... 2 : Allocating variables ... 1 : Initializing variables ... 2 : Initializing variables ... 1 : block size = 128 2 : block size = 128 1 : entering the main loop ... 2 : entering the main loop ... forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source amica15ub 00000000013E61B0 Unknown Unknown Unknown amica15ub 0000000000591297 Unknown Unknown Unknown amica15ub 00000000004D731C Unknown Unknown Unknown amica15ub 00000000004EFCFA Unknown Unknown Unknown amica15ub 00000000004F0081 Unknown Unknown Unknown amica15ub 00000000004D64FE Unknown Unknown Unknown amica15ub 000000000048D296 Unknown Unknown Unknown amica15ub 000000000047B29E Unknown Unknown Unknown amica15ub 00000000004798B1 Unknown Unknown Unknown amica15ub 000000000047D750 Unknown Unknown Unknown amica15ub 000000000047C875 Unknown Unknown Unknown amica15ub 000000000046E078 Unknown Unknown Unknown amica15ub 0000000000463F98 Unknown Unknown Unknown amica15ub 000000000041E93B Unknown Unknown Unknown amica15ub 000000000040288D Unknown Unknown Unknown amica15ub 00000000013DCD4A Unknown Unknown Unknown amica15ub 00000000013DE5A7 Unknown Unknown Unknown amica15ub 0000000000402765 Unknown Unknown Unknown Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed MPIR_Reduce_impl(1070)................: MPIR_Reduce_intra(868)................: MPIR_Reduce_redscat_gather(622).......: MPIDU_Complete_posted_with_error(1137): Process failed

— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535353207 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESTNWENE73PATYTAE6DXEQEWRANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESUS7LLKI2XXDTD3A4TXEQEWRA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QOQXO.gif Message ID: @. @.> >

japalmer29 commented 1 year ago

It may be that the number of threads has to divide the matrix block size (128 in your case). If you use do_opt_block it will try to optimize the block size over 128 – 1024 in steps of 128, but may crash if it tries a block size too large for the nodes. You can manually set the initial_matrix_block_size to something greater than 128.

From: vlawhern @.> Sent: Thursday, May 4, 2023 4:13 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

Processing arguments ... num_files = 1 FILES: /mnt/growler/barleyhome/vlawhern/tmpdata63236.fdt num_dir_files = 1 initial matrix block_size = 128 do_opt_block = 0 blk_min = 256 blk_step = 256 blk_max = 1024 number of models = 8 max_thrds = 20 use_min_dll = 1 min dll = 1.000000000000000E-009 use_grad_norm = 1 min grad norm = 1.000000000000000E-007 number of density mixture components = 3 pdf type = 0 max_iter = 2000 num_samples = 1 data_dim = 90 field_dim = 1744896 do_history = 0 histstep = 10 share_comps = 0 share_start = 100 comp_thresh = 0.990000000000000 share_int = 100 initial lrate = 5.000000000000000E-002 minimum lrate = 1.000000000000000E-008 minimum data covariance eigenvalue = 1.000000000000000E-012 lrate factor = 0.500000000000000 initial rholrate = 5.000000000000000E-002 rho0 = 1.50000000000000 min rho = 1.00000000000000 max rho = 2.00000000000000 rho lrate factor = 0.500000000000000 kurt_start = 3 num kurt = 5 kurt interval = 1 do_newton = 1 newt_start = 50 newt_ramp = 10 initial newton lrate = 1.00000000000000 do_reject = 1 num reject = 3 reject sigma = 3.00000000000000 reject start = 2 reject interval = 3 write step = 20 write_nd = 0 write_LLt = 1 dec window = 1 max_decs = 3 fix_init = 0 update_A = 1 update_c = 1 update_gm = 1 update_alpha = 1 update_mu = 1 update_beta = 1 invsigmax = 100.000000000000 invsigmin = 0.000000000000000E+000 do_rho = 1 load_rej = 0 load_c = 0 load_gm = 0 load_alpha = 0 load_mu = 0 load_beta = 0 load_rho = 0 load_comp_list = 0 do_mean = 1 do_sphere = 1 pcakeep = 90 pcadb = 30.0000000000000 byte_size = 4 doscaling = 1 scalestep = 1 mkdir: cannot create directory '/mnt/growler/barleyhome/vlawhern/amicaouttmp/': File exists output directory = /mnt/growler/barleyhome/vlawhern/amicaouttmp/ 1 : setting num_thrds to 20 ... 2 : setting num_thrds to 20 ... 1 : using 20 threads. 2 : using 20 threads. 1 : node_thrds = 20 20 bytes in real = 1 1 : REAL nbyte = 1 getting segment list ... blocks in sample = 1744896 total blocks = 1744896 node blocks = 872448 872448 node 1 start: file 1 sample 1 index 1 node 1 stop : file 1 sample 1 index 872448 node 2 start: file 1 sample 1 index 872449 node 2 stop : file 1 sample 1 index 1744896 1 : data = -2.10404443740845 -2.00284409523010 2 : data = 2.67275834083557 2.30718922615051 getting the mean ... mean = -8.140678184955731E-002 -4.924335967420339E-002 1.604046472164474E-002 subtracting the mean ... getting the covariance matrix ... cnt = 1744896 doing eig nx = 90 lwork = 81000 doing eig nx = 90 lwork = 81000 minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 getting the sphering matrix ... minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 sphering the data ... minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 numeigs = 90 1 : Allocating variables ... 2 : Allocating variables ... 1 : Initializing variables ... 2 : Initializing variables ... 1 : block size = 128 2 : block size = 128 1 : entering the main loop ... 2 : entering the main loop ... forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source amica15ub 00000000013E61B0 Unknown Unknown Unknown amica15ub 0000000000591297 Unknown Unknown Unknown amica15ub 00000000004D731C Unknown Unknown Unknown amica15ub 00000000004EFCFA Unknown Unknown Unknown amica15ub 00000000004F0081 Unknown Unknown Unknown amica15ub 00000000004D64FE Unknown Unknown Unknown amica15ub 000000000048D296 Unknown Unknown Unknown amica15ub 000000000047B29E Unknown Unknown Unknown amica15ub 00000000004798B1 Unknown Unknown Unknown amica15ub 000000000047D750 Unknown Unknown Unknown amica15ub 000000000047C875 Unknown Unknown Unknown amica15ub 000000000046E078 Unknown Unknown Unknown amica15ub 0000000000463F98 Unknown Unknown Unknown amica15ub 000000000041E93B Unknown Unknown Unknown amica15ub 000000000040288D Unknown Unknown Unknown amica15ub 00000000013DCD4A Unknown Unknown Unknown amica15ub 00000000013DE5A7 Unknown Unknown Unknown amica15ub 0000000000402765 Unknown Unknown Unknown Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed MPIR_Reduce_impl(1070)................: MPIR_Reduce_intra(868)................: MPIR_Reduce_redscat_gather(622).......: MPIDU_Complete_posted_with_error(1137): Process failed

— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535353207 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESTNWENE73PATYTAE6DXEQEWRANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESUS7LLKI2XXDTD3A4TXEQEWRA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QOQXO.gif Message ID: @. @.> >

vlawhern commented 1 year ago

so I've tried various combinations of block_size and max_threads and now I get some different errors, which just points to this being a MPI/cluster issue... our cluster is just 1Gb ethernet so perhaps there's something going wrong because of our slow network.

Processing arguments ...
 num_files =            1
 FILES:
 /barleyhome/vlawhern/tmpdata26667.fdt
 num_dir_files =            1
 initial matrix block_size =          256
 do_opt_block =            0
 blk_min =          256
 blk_step =          256
 blk_max =         1024
 number of models =            8
 max_thrds =            8
 use_min_dll =            1
 min dll =   9.999999999999999E-022
 use_grad_norm =            1
 min grad norm =   1.000000000000000E-019
 number of density mixture components =            3
 pdf type =            0
 max_iter =         2000
 num_samples =            1
 data_dim =           90
 field_dim =      1744896
 do_history =            0
 histstep =           10
 share_comps =            0
 share_start =          100
 comp_thresh =   0.990000000000000
 share_int =          100
 initial lrate =   5.000000000000000E-002
 minimum lrate =   1.000000000000000E-008
 minimum data covariance eigenvalue =   1.000000000000000E-012
 lrate factor =   0.500000000000000
 initial rholrate =   5.000000000000000E-002
 rho0 =    1.50000000000000
 min rho =    1.00000000000000
 max rho =    2.00000000000000
 rho lrate factor =   0.500000000000000
 kurt_start =            3
 num kurt =            5
 kurt interval =            1
 do_newton =            1
 newt_start =           50
 newt_ramp =           10
 initial newton lrate =    1.00000000000000
 do_reject =            1
 num reject =            3
 reject sigma =    3.00000000000000
 reject start =            2
 reject interval =            3
 write step =           20
 write_nd =            0
 write_LLt =            1
 dec window =            1
 max_decs =            3
 fix_init =            0
 update_A =            1
 update_c =            1
 update_gm =            1
 update_alpha =            1
 update_mu =            1
 update_beta =            1
 invsigmax =    100.000000000000
 invsigmin =   0.000000000000000E+000
 do_rho =            1
 load_rej =            0
 load_c =            0
 load_gm =            0
 load_alpha =            0
 load_mu =            0
 load_beta =            0
 load_rho =            0
 load_comp_list =            0
 do_mean =            1
 do_sphere =            1
 pcakeep =           90
 pcadb =    30.0000000000000
 byte_size =            4
 doscaling =            1
 scalestep =            1
mkdir: cannot create directory '/barleyhome/vlawhern/amicaouttmp/': File exists
 output directory = /barleyhome/vlawhern/amicaouttmp/
           1 : setting num_thrds to            8  ...
           2 : setting num_thrds to            8  ...
           3 : setting num_thrds to            8  ...
           4 : setting num_thrds to            8  ...
           1 : using           8 threads.
           3 : using           8 threads.
           2 : using           8 threads.
           4 : using           8 threads.
           1 : node_thrds =            8           8           8           8
 bytes in real =            1
           1 : REAL nbyte =            1
 getting segment list ...
 blocks in sample =      1744896
 total blocks =      1744896
 node blocks =       436224      436224      436224      436224
 node            1  start: file            1  sample            1  index
           1
 node            1  stop : file            1  sample            1  index
      436224
 node            2  start: file            1  sample            1  index
      436225
 node            2  stop : file            1  sample            1  index
      872448
 node            3  start: file            1  sample            1  index
      872449
 node            3  stop : file            1  sample            1  index
     1308672
 node            4  start: file            1  sample            1  index
     1308673
 node            4  stop : file            1  sample            1  index
     1744896
           4 : data =    1.82832920551300        1.67610311508179
           3 : data =    2.67275834083557        2.30718922615051
           2 : data =   0.152651995420456      -3.927214443683624E-002
           1 : data =   -2.10404443740845       -2.00284409523010
 getting the mean ...
  mean =  -8.140678184955731E-002 -4.924335967420339E-002
  1.604046472164474E-002
 subtracting the mean ...
 getting the covariance matrix ...
 cnt =      1744896
 doing eig nx =           90  lwork =        81000
 doing eig nx =           90  lwork =        81000
 minimum eigenvalues =   3.979936589245825E-003  4.130480204099402E-003
  4.307901288574601E-003
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =           90
 getting the sphering matrix ...
 doing eig nx =           90  lwork =        81000
 doing eig nx =           90  lwork =        81000
 minimum eigenvalues =   3.979936589245825E-003  4.130480204099402E-003
  4.307901288574601E-003
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =           90
 sphering the data ...
 minimum eigenvalues =   3.979936589245825E-003  4.130480204099402E-003
  4.307901288574601E-003
 minimum eigenvalues =   3.979936589245825E-003  4.130480204099402E-003
  4.307901288574601E-003
 minimum eigenvalues =   3.979936589245825E-003  4.130480204099402E-003
  4.307901288574601E-003
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =            0
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =            0
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =            0
 numeigs =           90
           3 : Allocating variables ...
           1 : Allocating variables ...
           4 : Allocating variables ...
           2 : Allocating variables ...
           4 : Initializing variables ...
           3 : Initializing variables ...
           2 : Initializing variables ...
           1 : Initializing variables ...
           1 : block size =          256
           2 : block size =          256
           3 : block size =          256
           4 : block size =          256
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
amica15ub          00000000013E61B0  Unknown               Unknown  Unknown
amica15ub          00000000004C07A4  Unknown               Unknown  Unknown
amica15ub          00000000004CB61B  Unknown               Unknown  Unknown
amica15ub          00000000004CDF08  Unknown               Unknown  Unknown
amica15ub          000000000048D1C5  Unknown               Unknown  Unknown
amica15ub          0000000000472D69  Unknown               Unknown  Unknown
amica15ub          000000000046DF8C  Unknown               Unknown  Unknown
amica15ub          000000000041E393  Unknown               Unknown  Unknown
amica15ub          000000000040288D  Unknown               Unknown  Unknown
amica15ub          00000000013DCD4A  Unknown               Unknown  Unknown
amica15ub          00000000013DE5A7  Unknown               Unknown  Unknown
amica15ub          0000000000402765  Unknown               Unknown  Unknown
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425).....................: MPI_Barrier(comm=0x84000001) failed
MPIR_Barrier_impl(332)................: Failure during collective
MPIR_Barrier_impl(327)................:
MPIR_Barrier(292).....................:
MPIR_Barrier_intra(169)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425).....................: MPI_Barrier(comm=0x84000001) failed
MPIR_Barrier_impl(332)................: Failure during collective
MPIR_Barrier_impl(327)................:
MPIR_Barrier(292).....................:
MPIR_Barrier_intra(169)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425)......: MPI_Barrier(comm=0x84000001) failed
MPIR_Barrier_impl(332).: Failure during collective
MPIR_Barrier_impl(327).:
MPIR_Barrier(292)......:
MPIR_Barrier_intra(180): Failure during collective
japalmer29 commented 1 year ago

There may be an issue with the sphering. It seems to be duplicating the sphering over nodes, which may indicate that the MPI is not working correctly. Are you using the MPI from Intel OneAPI, or MPICH2 (or other)?

From: vlawhern @.> Sent: Thursday, May 4, 2023 4:45 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

so I've tried various combinations of block_size and max_threads and now I get some different errors, which just points to this being a MPI/cluster issue... our cluster is just 1Gb ethernet so perhaps there's something going wrong because of our slow network.

Processing arguments ... num_files = 1 FILES: /barleyhome/vlawhern/tmpdata26667.fdt num_dir_files = 1 initial matrix block_size = 256 do_opt_block = 0 blk_min = 256 blk_step = 256 blk_max = 1024 number of models = 8 max_thrds = 8 use_min_dll = 1 min dll = 9.999999999999999E-022 use_grad_norm = 1 min grad norm = 1.000000000000000E-019 number of density mixture components = 3 pdf type = 0 max_iter = 2000 num_samples = 1 data_dim = 90 field_dim = 1744896 do_history = 0 histstep = 10 share_comps = 0 share_start = 100 comp_thresh = 0.990000000000000 share_int = 100 initial lrate = 5.000000000000000E-002 minimum lrate = 1.000000000000000E-008 minimum data covariance eigenvalue = 1.000000000000000E-012 lrate factor = 0.500000000000000 initial rholrate = 5.000000000000000E-002 rho0 = 1.50000000000000 min rho = 1.00000000000000 max rho = 2.00000000000000 rho lrate factor = 0.500000000000000 kurt_start = 3 num kurt = 5 kurt interval = 1 do_newton = 1 newt_start = 50 newt_ramp = 10 initial newton lrate = 1.00000000000000 do_reject = 1 num reject = 3 reject sigma = 3.00000000000000 reject start = 2 reject interval = 3 write step = 20 write_nd = 0 write_LLt = 1 dec window = 1 max_decs = 3 fix_init = 0 update_A = 1 update_c = 1 update_gm = 1 update_alpha = 1 update_mu = 1 update_beta = 1 invsigmax = 100.000000000000 invsigmin = 0.000000000000000E+000 do_rho = 1 load_rej = 0 load_c = 0 load_gm = 0 load_alpha = 0 load_mu = 0 load_beta = 0 load_rho = 0 load_comp_list = 0 do_mean = 1 do_sphere = 1 pcakeep = 90 pcadb = 30.0000000000000 byte_size = 4 doscaling = 1 scalestep = 1 mkdir: cannot create directory '/barleyhome/vlawhern/amicaouttmp/': File exists output directory = /barleyhome/vlawhern/amicaouttmp/ 1 : setting num_thrds to 8 ... 2 : setting num_thrds to 8 ... 3 : setting num_thrds to 8 ... 4 : setting num_thrds to 8 ... 1 : using 8 threads. 3 : using 8 threads. 2 : using 8 threads. 4 : using 8 threads. 1 : node_thrds = 8 8 8 8 bytes in real = 1 1 : REAL nbyte = 1 getting segment list ... blocks in sample = 1744896 total blocks = 1744896 node blocks = 436224 436224 436224 436224 node 1 start: file 1 sample 1 index 1 node 1 stop : file 1 sample 1 index 436224 node 2 start: file 1 sample 1 index 436225 node 2 stop : file 1 sample 1 index 872448 node 3 start: file 1 sample 1 index 872449 node 3 stop : file 1 sample 1 index 1308672 node 4 start: file 1 sample 1 index 1308673 node 4 stop : file 1 sample 1 index 1744896 4 : data = 1.82832920551300 1.67610311508179 3 : data = 2.67275834083557 2.30718922615051 2 : data = 0.152651995420456 -3.927214443683624E-002 1 : data = -2.10404443740845 -2.00284409523010 getting the mean ... mean = -8.140678184955731E-002 -4.924335967420339E-002 1.604046472164474E-002 subtracting the mean ... getting the covariance matrix ... cnt = 1744896 doing eig nx = 90 lwork = 81000 doing eig nx = 90 lwork = 81000 minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 getting the sphering matrix ... doing eig nx = 90 lwork = 81000 doing eig nx = 90 lwork = 81000 minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 sphering the data ... minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 numeigs = 90 3 : Allocating variables ... 1 : Allocating variables ... 4 : Allocating variables ... 2 : Allocating variables ... 4 : Initializing variables ... 3 : Initializing variables ... 2 : Initializing variables ... 1 : Initializing variables ... 1 : block size = 256 2 : block size = 256 3 : block size = 256 4 : block size = 256 forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source amica15ub 00000000013E61B0 Unknown Unknown Unknown amica15ub 00000000004C07A4 Unknown Unknown Unknown amica15ub 00000000004CB61B Unknown Unknown Unknown amica15ub 00000000004CDF08 Unknown Unknown Unknown amica15ub 000000000048D1C5 Unknown Unknown Unknown amica15ub 0000000000472D69 Unknown Unknown Unknown amica15ub 000000000046DF8C Unknown Unknown Unknown amica15ub 000000000041E393 Unknown Unknown Unknown amica15ub 000000000040288D Unknown Unknown Unknown amica15ub 00000000013DCD4A Unknown Unknown Unknown amica15ub 00000000013DE5A7 Unknown Unknown Unknown amica15ub 0000000000402765 Unknown Unknown Unknown Fatal error in PMPI_Barrier: Unknown error class, error stack: PMPI_Barrier(425).....................: MPI_Barrier(comm=0x84000001) failed MPIR_Barrier_impl(332)................: Failure during collective MPIR_Barrier_impl(327)................: MPIR_Barrier(292).....................: MPIR_Barrier_intra(169)...............: MPIDU_Complete_posted_with_error(1137): Process failed Fatal error in PMPI_Barrier: Unknown error class, error stack: PMPI_Barrier(425).....................: MPI_Barrier(comm=0x84000001) failed MPIR_Barrier_impl(332)................: Failure during collective MPIR_Barrier_impl(327)................: MPIR_Barrier(292).....................: MPIR_Barrier_intra(169)...............: MPIDU_Complete_posted_with_error(1137): Process failed Fatal error in PMPI_Barrier: Unknown error class, error stack: PMPI_Barrier(425)......: MPI_Barrier(comm=0x84000001) failed MPIR_Barrier_impl(332).: Failure during collective MPIR_Barrier_impl(327).: MPIR_Barrier(292)......: MPIR_Barrier_intra(180): Failure during collective

— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535388201 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESQ43N35GKVODANTKALXEQIMXANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBEST6QMNUXEWZFFL3IRTXEQIMXA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QQVCS.gif Message ID: @. @.> >

vlawhern commented 1 year ago

so I managed to test this out on a dedicated HPC platform (using the HPC system's MPI implementation which is based on Intel OneAPI) across 4 nodes with AMICA at 8 models and I get the same errors as above. So that seems to suggest it's not the MPI implementation that's the problem, even though all the errors point to MPI.

I still find it suspicious that if I set 'pcakeep' to 32 (so do a lot of dimension reduction) that everything works, with AMICA 8 models across 8 nodes. It also works if I just select 32 channels from the 90 channels to do the decomposition, so no pca dimension reduction. Once I hit around 40-50 channels things fail. So it might be something with my data... if I check for the data rank it gives me 90 channels and its eigenvalues all appear to be fine (above say a 1E-7 tolerance). I'll have to see if I can find a similar sized data open-source to test this further.

Setting num_threads=1 produces a slightly different set of MPI errors, but in general I always get either the "MPI_Barrier" error or something with "MPI_Reduce"... the fact I get different errors with different combinations of models/nodes also seems to maybe point to the data, not MPI?

just for completeness, the command I call from MATLAB is

system(['mpirun -n ' num2str(numprocs) ' -machinefile ~/hostfile ~/software/amica/amica15ub ' outdir 'input.param' ]);

where hostfile is just a list of hostnames, one on each line.

Processing arguments ...
 num_files =            1
 FILES:
 /mnt/growler/barleyhome/vlawhern/tmpdata84913.fdt
 num_dir_files =            1
 initial matrix block_size =          128
 do_opt_block =            0
 blk_min =          256
 blk_step =          256
 blk_max =         1024
 number of models =            8
 max_thrds =            1
 use_min_dll =            1
 min dll =   1.000000000000000E-009
 use_grad_norm =            1
 min grad norm =   1.000000000000000E-007
 number of density mixture components =            3
 pdf type =            0
 max_iter =         2000
 num_samples =            1
 data_dim =           90
 field_dim =      1744896
 do_history =            0
 histstep =           10
 share_comps =            0
 share_start =          100
 comp_thresh =   0.990000000000000
 share_int =          100
 initial lrate =   5.000000000000000E-002
 minimum lrate =   1.000000000000000E-008
 minimum data covariance eigenvalue =   1.000000000000000E-012
 lrate factor =   0.500000000000000
 initial rholrate =   5.000000000000000E-002
 rho0 =    1.50000000000000
 min rho =    1.00000000000000
 max rho =    2.00000000000000
 rho lrate factor =   0.500000000000000
 kurt_start =            3
 num kurt =            5
 kurt interval =            1
 do_newton =            1
 newt_start =           50
 newt_ramp =           10
 initial newton lrate =    1.00000000000000
 do_reject =            0
 num reject =            3
 reject sigma =    3.00000000000000
 reject start =            2
 reject interval =            3
 write step =           20
 write_nd =            0
 write_LLt =            1
 dec window =            1
 max_decs =            3
 fix_init =            0
 update_A =            1
 update_c =            1
 update_gm =            1
 update_alpha =            1
 update_mu =            1
 update_beta =            1
 invsigmax =    100.000000000000
 invsigmin =   0.000000000000000E+000
 do_rho =            1
 load_rej =            0
 load_c =            0
 load_gm =            0
 load_alpha =            0
 load_mu =            0
 load_beta =            0
 load_rho =            0
 load_comp_list =            0
 do_mean =            1
 do_sphere =            1
 pcakeep =           64
 pcadb =    30.0000000000000
 byte_size =            4
 doscaling =            1
 scalestep =            1
mkdir: cannot create directory '/mnt/growler/barleyhome/vlawhern/amicaouttmp/': File exists
 output directory = /mnt/growler/barleyhome/vlawhern/amicaouttmp/
           1 : setting num_thrds to            1  ...
           3 : setting num_thrds to            1  ...
           2 : setting num_thrds to            1  ...
           4 : setting num_thrds to            1  ...
           2 : using           1 threads.
           3 : using           1 threads.
           1 : using           1 threads.
           4 : using           1 threads.
           1 : node_thrds =            1           1           1           1
 bytes in real =            1
           1 : REAL nbyte =            1
 getting segment list ...
 blocks in sample =      1744896
 total blocks =      1744896
 node blocks =       436224      436224      436224      436224
 node            1  start: file            1  sample            1  index
           1
 node            1  stop : file            1  sample            1  index
      436224
 node            2  start: file            1  sample            1  index
      436225
 node            2  stop : file            1  sample            1  index
      872448
 node            3  start: file            1  sample            1  index
      872449
 node            3  stop : file            1  sample            1  index
     1308672
 node            4  start: file            1  sample            1  index
     1308673
 node            4  stop : file            1  sample            1  index
     1744896
           4 : data =    1.82832920551300        1.67610311508179
           3 : data =    2.67275834083557        2.30718922615051
           2 : data =   0.152651995420456      -3.927214443683624E-002
           1 : data =   -2.10404443740845       -2.00284409523010
 getting the mean ...
  mean =  -8.140678184955731E-002 -4.924335967420339E-002
  1.604046472164474E-002
 subtracting the mean ...
 getting the covariance matrix ...
 cnt =      1744896
 doing eig nx =           90  lwork =        81000
 doing eig nx =           90  lwork =        81000
 doing eig nx =           90  lwork =        81000
 minimum eigenvalues =   3.979936589214525E-003  4.130480204143833E-003
  4.307901288597552E-003
 doing eig nx =           90  lwork =        81000
 minimum eigenvalues =   3.979936589214525E-003  4.130480204143833E-003
  4.307901288597552E-003
 minimum eigenvalues =   3.979936589214525E-003  4.130480204143833E-003
  4.307901288597552E-003
 minimum eigenvalues =   3.979936589214525E-003  4.130480204143833E-003
  4.307901288597552E-003
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =            0
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =            0
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =            0
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =           64
 getting the sphering matrix ...
 minimum eigenvalues =   3.979936589214525E-003  4.130480204143833E-003
  4.307901288597552E-003
 maximum eigenvalues =    2663.49088331849        1448.05399130587
   850.039381071617
 num eigs kept =           64
 sphering the data ...
 numeigs =           64
           1 : Allocating variables ...
           3 : Allocating variables ...
           2 : Allocating variables ...
           4 : Allocating variables ...
           3 : Initializing variables ...
           2 : Initializing variables ...
           4 : Initializing variables ...
           1 : Initializing variables ...
           1 : block size =          128
           3 : block size =          128
           2 : block size =          128
           4 : block size =          128
           4 : entering the main loop ...
           1 : entering the main loop ...
           3 : entering the main loop ...
           2 : entering the main loop ...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
amica15ub          000000000116EB35  Unknown               Unknown  Unknown
amica15ub          000000000116C8F7  Unknown               Unknown  Unknown
amica15ub          0000000001122954  Unknown               Unknown  Unknown
amica15ub          0000000001122766  Unknown               Unknown  Unknown
amica15ub          00000000010D4D19  Unknown               Unknown  Unknown
amica15ub          00000000010D8F90  Unknown               Unknown  Unknown
amica15ub          00000000005D21F0  Unknown               Unknown  Unknown
amica15ub          00000000005993FC  Unknown               Unknown  Unknown
amica15ub          00000000004DC39A  Unknown               Unknown  Unknown
amica15ub          00000000004F5900  Unknown               Unknown  Unknown
amica15ub          00000000004F5C41  Unknown               Unknown  Unknown
amica15ub          00000000004DB633  Unknown               Unknown  Unknown
amica15ub          000000000048E485  Unknown               Unknown  Unknown
amica15ub          000000000047AB46  Unknown               Unknown  Unknown
amica15ub          00000000004790B9  Unknown               Unknown  Unknown
amica15ub          000000000047CF50  Unknown               Unknown  Unknown
amica15ub          000000000047C0D4  Unknown               Unknown  Unknown
amica15ub          000000000046CD18  Unknown               Unknown  Unknown
amica15ub          0000000000462215  Unknown               Unknown  Unknown
amica15ub          0000000000417EA3  Unknown               Unknown  Unknown
amica15ub          00000000004021DE  Unknown               Unknown  Unknown
amica15ub          000000000118C1A4  Unknown               Unknown  Unknown
amica15ub          00000000004020C1  Unknown               Unknown  Unknown
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7fb954cbe010, rbuf=0x7fb954cff010, count=32768, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(469).......:
MPIDU_Complete_posted_with_error(1137): Process failed
MPIR_Reduce_redscat_gather(605).......:
MPIC_Send(300)........................:
MPID_Send(75).........................: Communication error with rank 1
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7fa27dbd7010, rbuf=0x7fa27dc18010, count=32768, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(622).......:
MPIDU_Complete_posted_with_error(1137): Process failed
vlawhern commented 1 year ago

interestingly if I keep trying to run the same command several times.. once in a while it will work (maybe once out of 10 attempts). So it's possible that there something with random initialization of weights that is causing some issues?

japalmer29 commented 1 year ago

Given the dependence on channel number, I think it might be an issue with stack overflow. Can you try setting the initial matrix block size to 32 and see if that improves the success rate?

From: vlawhern @.> Sent: Friday, May 5, 2023 10:30 AM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)

interestingly if I keep trying to run the same command several times.. once in a while it will work (maybe once out of 10 attempts). So it's possible that there something with random initialization of weights that is causing some issues?

— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1536348662 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESUF4TNES4RHKQ5WDYLXEUFGHANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESQMTCLSE6DAA2CNTQLXEUFGHA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3SLI7M.gif Message ID: @. @.> >

vlawhern commented 1 year ago

Just getting back to this.. unfortunately setting block_size to 32 doesn't fix it.

So the summary of my issue is that I can run AMICA on a single node with any combination of channels/models without issue, but when I start trying to distribute the AMICA model I run into issues that are solved if I reduce the number of channels, number of models and/or number of nodes. Generally speaking, the more models I try to fit, the less nodes I can use. Same thing with number of channels.. if I distribute the model across many nodes I sometimes have to reduce the channels for it to run.

Strangely enough AMICA will "sometimes" work without having to reduce channels/nodes/models (maybe 1 out of 20 attempts), which suggests possibly some kind of bad initialization of variables?

I cannot rule out the issue being the data itself however.

vlawhern commented 1 year ago

on one run I was able to start, it did throw NaNs though..

 iter     1 lrate =  0.0500000000 LL =  -1.4122133988 nd =  0.0377896272, D =   0.34899E-01  0.32313E-01  ( 13.41 s,   0.1 h)
 iter     2 lrate =  0.0500000000 LL =  -1.3408517205 nd =  0.0458156159, D =   0.35362E-01  0.32421E-01  ( 13.62 s,   0.1 h)
Doing rejection ....
 maximum likelihood value =  -0.825738566146084
 minimum likelihood value =   -33.4286597657578
 average likelihood value =   -1.34085172047297
 standard deviation       =   0.331891165020141
 rejecting data with likelihood less than   -2.33652521553340
 rejected        17341  data points so far. Will perform rejection            2
  more times at intervals of            3  iterations.
 iter     3 lrate =  0.0500000000 LL =  -1.2840568489 nd =  0.0584500863, D =   0.16122E+00  0.42687E-01  ( 13.62 s,   0.1 h)
 iter     4 lrate =  0.0500000000 LL =  -1.2541466667 nd =  0.0716875099, D =   0.53389E+00  0.10074E+00  ( 13.89 s,   0.1 h)
 iter     5 lrate =  0.0500000000 LL =  -1.2235514437 nd =  0.0770498115, D =   0.12020E+01  0.24694E+00  ( 14.01 s,   0.1 h)
Doing rejection ....
 maximum likelihood value =  -0.433675846405276
 minimum likelihood value =   -2.30432150510246
 average likelihood value =   -1.22355144373999
 standard deviation       =   0.221314189217103
 rejecting data with likelihood less than   -1.88749401139129
 rejected        24580
  data points so far. Will perform rejection one more time after            3
  iterations.
 iter     6 lrate =  0.0500000000 LL =  -1.1898742223 nd =  0.0795726861, D =   0.22151E+01  0.51242E+00  ( 14.01 s,   0.1 h)
 iter     7 lrate =  0.0500000000 LL =  -1.1590266760 nd =  0.0807673208, D =   0.35559E+01  0.87024E+00  ( 14.00 s,   0.1 h)
 iter     8 lrate =  0.0500000000 LL =  -1.1299826128 nd =  0.0809009705, D =   0.52734E+01  0.13227E+01  ( 14.00 s,   0.0 h)
Doing rejection ....
 maximum likelihood value =  -0.308914812838824
 minimum likelihood value =   -2.10933240322047
 average likelihood value =   -1.12998261283671
 standard deviation       =   0.206985385449063
 rejecting data with likelihood less than   -1.75093876918390
 rejected        33264  data points. No further rejections will be performed.
 iter     9 lrate =  0.0500000000 LL =  -1.0981002609 nd =  0.0806803867, D =   0.72688E+01  0.18587E+01  ( 14.01 s,   0.0 h)
 iter    10 lrate =  0.0500000000 LL =  -1.0701106300 nd =  0.0879733818, D =   0.94008E+01  0.24690E+01  ( 13.82 s,   0.0 h)
 iter    11 lrate =  0.0500000000 LL =  -1.0699184876 nd =  0.3447542953, D =   0.11750E+02  0.31434E+01  ( 13.81 s,   0.0 h)
 iter    12 lrate =  0.0500000000 LL =  -1.0346399046 nd =  0.1538665440, D =   0.16827E+02  0.38720E+01  ( 13.97 s,   0.0 h)
 iter    13 lrate =  0.0500000000 LL =  -1.0143801026 nd =  0.4059233348, D =   0.18721E+02  0.46582E+01  ( 13.98 s,   0.0 h)
 iter    14 lrate =  0.0500000000 LL =  -1.0762979759 nd =  0.9911425666, D =   0.24802E+02  0.54779E+01  ( 13.91 s,   0.0 h)
 Likelihood decreasing!
 iter    15 lrate =  0.0500000000 LL =  -1.0585870497 nd =  1.4571456080, D =   0.37602E+02  0.63122E+01  ( 13.96 s,   0.0 h)
 iter    16 lrate =  0.0500000000 LL =  -1.0390555955 nd =  3.9317816941, D =   0.44403E+02  0.72538E+01  ( 13.85 s,   0.0 h)
 iter    17 lrate =  0.0500000000 LL =  -1.1500486359 nd = 11.7298001155, D =   0.10362E+03  0.81841E+01  ( 13.86 s,   0.0 h)
 Likelihood decreasing!
 iter    18 lrate =  0.0500000000 LL =  -1.0736096696 nd = *************, D =   0.21085E+03  0.91530E+01  ( 13.91 s,   0.0 h)
 iter    19 lrate =  0.0500000000 LL =            NaN nd =           NaN, D =   0.46975E+03  0.10185E+02  ( 13.83 s,   0.0 h)
 Got NaN! Exiting ...

So maybe this again points to the data itself as the problem

vlawhern commented 1 year ago

So just circling back on this thread... it turns out it's our data that's the problem. Checking for the rank of the data in smaller chunks (as opposed to checking the rank of the full data) shows that the data isn't full rank all the time. Our processing pipeline includes artifact rejection with ASR (artifact subspace reconstruction) so it's likely this is causing the issue as ASR uses PCA to do component removal and many components are being removed in certain sections of the data. I now strongly believe that the issue is because AMICA in distributed mode each worker gets a chunk of the data, and each chunk is now not likely to be full-rank because of ASR this is causing all the issues as discussed in this thread.

More workers -> each worker gets a smaller chunk of the data -> more likely to run into rank-issues for at least one of the workers -> thus causing crashes.

I wonder if there should be a rank-check of the data for each chunk when AMICA is being run in distributed mode to avoid this issue going forward.

However I do have some questions; you do have any intuition about the behavior of AMICA decompositions when data is not full rank across subsets of the data. Say for 64-channel data you check the rank of the data in 30-second chunks and get a range of like [16, 64]. So you try and fit AMICA with 32 components.. in the range where the data is truly rank 16.. the 32-component decomposition of that section could probably look really strange. Should you try and fit the data to min(rank of chunks) to guarantee each worker gets "good" data (e.g. full-rank)?