Open vlawhern opened 1 year ago
Can you give an example of a run that crashes? Have you checked the command line or out.txt for error messages?
From: vlawhern @.> Sent: Thursday, May 4, 2023 11:40:30 AM To: sccn/amica @.> Cc: Subscribed @.***> Subject: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
I'm trying to run AMICA across a large number of nodes in a distributed fashion and I'm finding that for certain combinations of (1) num_models, (2) number of compute nodes and (3) the number of channels of the EEG data that AMICA crashes. I suspect it might be for very large number of compute nodes that there isn't enough data per node to do high density decompositions.
I was wondering if there were general rules about when AMICA would work in a distributed manner for say
T = length of the data N = number of compute nodes M = number of models C = number of channels
Seems like T/N is the amount of data given per node, so it comes to some relationship between M and C for T/N, but it'd be nice if there was some guidance on what values of T/N/M/C would work for distributed AMICA.
— Reply to this email directly, view it on GitHubhttps://github.com/sccn/amica/issues/40, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACRBESQSJVJUTGSO3K2AOADXEPEW5ANCNFSM6AAAAAAXV7FKS4. You are receiving this because you are subscribed to this thread.Message ID: @.***>
I get errors from MPI, the most common error I get is this one:
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 649: MPIU_Object_get_ref((req)) >= 0
Sometimes I'll get this one as well:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
amica15ub 00000000013E61B0 Unknown Unknown Unknown
amica15ub 0000000000591297 Unknown Unknown Unknown
amica15ub 00000000004D731C Unknown Unknown Unknown
amica15ub 00000000004EFCFA Unknown Unknown Unknown
amica15ub 00000000004F0081 Unknown Unknown Unknown
amica15ub 00000000004D64FE Unknown Unknown Unknown
amica15ub 000000000048D296 Unknown Unknown Unknown
amica15ub 000000000047B29E Unknown Unknown Unknown
amica15ub 00000000004798B1 Unknown Unknown Unknown
amica15ub 000000000047D750 Unknown Unknown Unknown
amica15ub 000000000047C875 Unknown Unknown Unknown
amica15ub 000000000046E078 Unknown Unknown Unknown
amica15ub 0000000000463F98 Unknown Unknown Unknown
amica15ub 000000000041E93B Unknown Unknown Unknown
amica15ub 000000000040288D Unknown Unknown Unknown
amica15ub 00000000013DCD4A Unknown Unknown Unknown
amica15ub 00000000013DE5A7 Unknown Unknown Unknown
amica15ub 0000000000402765 Unknown Unknown Unknown
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7f8c15ece8c0, rbuf=0x7f8c15e8c880, count=32768, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(469).......:
MPIDU_Complete_posted_with_error(1137): Process failed
Seems like this points to not enough data being available for all the MPI workers when the num_models and num_workers are large (in this case AMICA with 8 models and distributed across 8 nodes on 90-channel EEG data, approx 1.8M samples). I've confirmed if I work with less EEG data (90 channel -> 32 channel) this works
in addition if I work with less nodes (3 or 4) but still work with high-density data (~64 channel) this also works. So this points to some boundaries based on length of data, number of workers, number of models and number of channels in the EEG data
The number of models should just multiply the run time. Parallelization is only over segmented data as you said. Have you tried with 1 model with 90x1.8M data?
From: vlawhern @.> Sent: Thursday, May 4, 2023 3:57:08 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
I get errors from MPI, the most common error I get is this one:
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 649: MPIU_Object_get_ref((req)) >= 0
Sometimes I'll get this one as well:
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source amica15ub 00000000013E61B0 Unknown Unknown Unknown amica15ub 0000000000591297 Unknown Unknown Unknown amica15ub 00000000004D731C Unknown Unknown Unknown amica15ub 00000000004EFCFA Unknown Unknown Unknown amica15ub 00000000004F0081 Unknown Unknown Unknown amica15ub 00000000004D64FE Unknown Unknown Unknown amica15ub 000000000048D296 Unknown Unknown Unknown amica15ub 000000000047B29E Unknown Unknown Unknown amica15ub 00000000004798B1 Unknown Unknown Unknown amica15ub 000000000047D750 Unknown Unknown Unknown amica15ub 000000000047C875 Unknown Unknown Unknown amica15ub 000000000046E078 Unknown Unknown Unknown amica15ub 0000000000463F98 Unknown Unknown Unknown amica15ub 000000000041E93B Unknown Unknown Unknown amica15ub 000000000040288D Unknown Unknown Unknown amica15ub 00000000013DCD4A Unknown Unknown Unknown amica15ub 00000000013DE5A7 Unknown Unknown Unknown amica15ub 0000000000402765 Unknown Unknown Unknown Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7f8c15ece8c0, rbuf=0x7f8c15e8c880, count=32768, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed MPIR_Reduce_impl(1070)................: MPIR_Reduce_intra(868)................: MPIR_Reduce_redscat_gather(469).......: MPIDU_Complete_posted_with_error(1137): Process failed
Seems like this points to not enough data being available for all the MPI workers when the num_models and num_workers are large (in this case AMICA with 8 models and distributed across 8 nodes on 90-channel EEG data, approx 1.8M samples). I've confirmed if I work with less EEG data (90 channel -> 32 channel) this works
— Reply to this email directly, view it on GitHubhttps://github.com/sccn/amica/issues/40#issuecomment-1535335420, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACRBESQSF7SZD6NPYFGD5FLXEQCZJANCNFSM6AAAAAAXV7FKS4. You are receiving this because you commented.Message ID: @.***>
In the Hsu et al paper, Amica was run with more models, nodes, and data than in your case, so I don’t think there is a limit based on the code. There may be an issue with the cluster settings. The error you get is apparently before any output, when the data is being distributed maybe. I thought it output something before doing mpi reduce etc for getting means and covariance.
From: vlawhern @.> Sent: Thursday, May 4, 2023 4:02 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
in addition if I work with less nodes (3 or 4) but still work with high-density data (~64 channel) this also works. So this points to some boundaries based on length of data, number of workers, number of models and number of channels in the EEG data
— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535340673 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESWPQL4VK43BJV6Y5NLXEQDKXANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESVVAA7JPHDQ5DW5NVLXEQDKXA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QNYIC.gif Message ID: @. @.> >
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(622).......:
MPIDU_Complete_posted_with_error(1137): Process failed
Do the errors you sent occur immediately, or after some normal output. If the latter, could you send the errors in context?
From: vlawhern @.> Sent: Thursday, May 4, 2023 4:02 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
in addition if I work with less nodes (3 or 4) but still work with high-density data (~64 channel) this also works. So this points to some boundaries based on length of data, number of workers, number of models and number of channels in the EEG data
— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535340673 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESWPQL4VK43BJV6Y5NLXEQDKXANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESVVAA7JPHDQ5DW5NVLXEQDKXA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QNYIC.gif Message ID: @. @.> >
Please paste the whole output from the crash, so I can check the initial output.
From: vlawhern @.> Sent: Thursday, May 4, 2023 4:11 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed MPIR_Reduce_impl(1070)................: MPIR_Reduce_intra(868)................: MPIR_Reduce_redscat_gather(622).......: MPIDU_Complete_posted_with_error(1137): Process failed
— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535350762 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESTJS4HRSVRLVC7IU7TXEQEN3ANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESX5RFZAIRVZYKTQJE3XEQEN3A5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QOL6U.gif Message ID: @. @.> >
Processing arguments ...
num_files = 1
FILES:
/mnt/growler/barleyhome/vlawhern/tmpdata63236.fdt
num_dir_files = 1
initial matrix block_size = 128
do_opt_block = 0
blk_min = 256
blk_step = 256
blk_max = 1024
number of models = 8
max_thrds = 20
use_min_dll = 1
min dll = 1.000000000000000E-009
use_grad_norm = 1
min grad norm = 1.000000000000000E-007
number of density mixture components = 3
pdf type = 0
max_iter = 2000
num_samples = 1
data_dim = 90
field_dim = 1744896
do_history = 0
histstep = 10
share_comps = 0
share_start = 100
comp_thresh = 0.990000000000000
share_int = 100
initial lrate = 5.000000000000000E-002
minimum lrate = 1.000000000000000E-008
minimum data covariance eigenvalue = 1.000000000000000E-012
lrate factor = 0.500000000000000
initial rholrate = 5.000000000000000E-002
rho0 = 1.50000000000000
min rho = 1.00000000000000
max rho = 2.00000000000000
rho lrate factor = 0.500000000000000
kurt_start = 3
num kurt = 5
kurt interval = 1
do_newton = 1
newt_start = 50
newt_ramp = 10
initial newton lrate = 1.00000000000000
do_reject = 1
num reject = 3
reject sigma = 3.00000000000000
reject start = 2
reject interval = 3
write step = 20
write_nd = 0
write_LLt = 1
dec window = 1
max_decs = 3
fix_init = 0
update_A = 1
update_c = 1
update_gm = 1
update_alpha = 1
update_mu = 1
update_beta = 1
invsigmax = 100.000000000000
invsigmin = 0.000000000000000E+000
do_rho = 1
load_rej = 0
load_c = 0
load_gm = 0
load_alpha = 0
load_mu = 0
load_beta = 0
load_rho = 0
load_comp_list = 0
do_mean = 1
do_sphere = 1
pcakeep = 90
pcadb = 30.0000000000000
byte_size = 4
doscaling = 1
scalestep = 1
mkdir: cannot create directory '/mnt/growler/barleyhome/vlawhern/amicaouttmp/': File exists
output directory = /mnt/growler/barleyhome/vlawhern/amicaouttmp/
1 : setting num_thrds to 20 ...
2 : setting num_thrds to 20 ...
1 : using 20 threads.
2 : using 20 threads.
1 : node_thrds = 20 20
bytes in real = 1
1 : REAL nbyte = 1
getting segment list ...
blocks in sample = 1744896
total blocks = 1744896
node blocks = 872448 872448
node 1 start: file 1 sample 1 index
1
node 1 stop : file 1 sample 1 index
872448
node 2 start: file 1 sample 1 index
872449
node 2 stop : file 1 sample 1 index
1744896
1 : data = -2.10404443740845 -2.00284409523010
2 : data = 2.67275834083557 2.30718922615051
getting the mean ...
mean = -8.140678184955731E-002 -4.924335967420339E-002
1.604046472164474E-002
subtracting the mean ...
getting the covariance matrix ...
cnt = 1744896
doing eig nx = 90 lwork = 81000
doing eig nx = 90 lwork = 81000
minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003
4.307901288673726E-003
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 90
getting the sphering matrix ...
minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003
4.307901288673726E-003
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 90
sphering the data ...
minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003
4.307901288673726E-003
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 0
numeigs = 90
1 : Allocating variables ...
2 : Allocating variables ...
1 : Initializing variables ...
2 : Initializing variables ...
1 : block size = 128
2 : block size = 128
1 : entering the main loop ...
2 : entering the main loop ...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
amica15ub 00000000013E61B0 Unknown Unknown Unknown
amica15ub 0000000000591297 Unknown Unknown Unknown
amica15ub 00000000004D731C Unknown Unknown Unknown
amica15ub 00000000004EFCFA Unknown Unknown Unknown
amica15ub 00000000004F0081 Unknown Unknown Unknown
amica15ub 00000000004D64FE Unknown Unknown Unknown
amica15ub 000000000048D296 Unknown Unknown Unknown
amica15ub 000000000047B29E Unknown Unknown Unknown
amica15ub 00000000004798B1 Unknown Unknown Unknown
amica15ub 000000000047D750 Unknown Unknown Unknown
amica15ub 000000000047C875 Unknown Unknown Unknown
amica15ub 000000000046E078 Unknown Unknown Unknown
amica15ub 0000000000463F98 Unknown Unknown Unknown
amica15ub 000000000041E93B Unknown Unknown Unknown
amica15ub 000000000040288D Unknown Unknown Unknown
amica15ub 00000000013DCD4A Unknown Unknown Unknown
amica15ub 00000000013DE5A7 Unknown Unknown Unknown
amica15ub 0000000000402765 Unknown Unknown Unknown
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(622).......:
MPIDU_Complete_posted_with_error(1137): Process failed
I think the issue is you are using too many threads (max_threads). The number of threads should be the number of cores (or less) on each node. Using more threads may cause problems. Can you try reducing max_threads to 4, 8, or 12?
From: vlawhern @.> Sent: Thursday, May 4, 2023 4:13 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
Processing arguments ... num_files = 1 FILES: /mnt/growler/barleyhome/vlawhern/tmpdata63236.fdt num_dir_files = 1 initial matrix block_size = 128 do_opt_block = 0 blk_min = 256 blk_step = 256 blk_max = 1024 number of models = 8 max_thrds = 20 use_min_dll = 1 min dll = 1.000000000000000E-009 use_grad_norm = 1 min grad norm = 1.000000000000000E-007 number of density mixture components = 3 pdf type = 0 max_iter = 2000 num_samples = 1 data_dim = 90 field_dim = 1744896 do_history = 0 histstep = 10 share_comps = 0 share_start = 100 comp_thresh = 0.990000000000000 share_int = 100 initial lrate = 5.000000000000000E-002 minimum lrate = 1.000000000000000E-008 minimum data covariance eigenvalue = 1.000000000000000E-012 lrate factor = 0.500000000000000 initial rholrate = 5.000000000000000E-002 rho0 = 1.50000000000000 min rho = 1.00000000000000 max rho = 2.00000000000000 rho lrate factor = 0.500000000000000 kurt_start = 3 num kurt = 5 kurt interval = 1 do_newton = 1 newt_start = 50 newt_ramp = 10 initial newton lrate = 1.00000000000000 do_reject = 1 num reject = 3 reject sigma = 3.00000000000000 reject start = 2 reject interval = 3 write step = 20 write_nd = 0 write_LLt = 1 dec window = 1 max_decs = 3 fix_init = 0 update_A = 1 update_c = 1 update_gm = 1 update_alpha = 1 update_mu = 1 update_beta = 1 invsigmax = 100.000000000000 invsigmin = 0.000000000000000E+000 do_rho = 1 load_rej = 0 load_c = 0 load_gm = 0 load_alpha = 0 load_mu = 0 load_beta = 0 load_rho = 0 load_comp_list = 0 do_mean = 1 do_sphere = 1 pcakeep = 90 pcadb = 30.0000000000000 byte_size = 4 doscaling = 1 scalestep = 1 mkdir: cannot create directory '/mnt/growler/barleyhome/vlawhern/amicaouttmp/': File exists output directory = /mnt/growler/barleyhome/vlawhern/amicaouttmp/ 1 : setting num_thrds to 20 ... 2 : setting num_thrds to 20 ... 1 : using 20 threads. 2 : using 20 threads. 1 : node_thrds = 20 20 bytes in real = 1 1 : REAL nbyte = 1 getting segment list ... blocks in sample = 1744896 total blocks = 1744896 node blocks = 872448 872448 node 1 start: file 1 sample 1 index 1 node 1 stop : file 1 sample 1 index 872448 node 2 start: file 1 sample 1 index 872449 node 2 stop : file 1 sample 1 index 1744896 1 : data = -2.10404443740845 -2.00284409523010 2 : data = 2.67275834083557 2.30718922615051 getting the mean ... mean = -8.140678184955731E-002 -4.924335967420339E-002 1.604046472164474E-002 subtracting the mean ... getting the covariance matrix ... cnt = 1744896 doing eig nx = 90 lwork = 81000 doing eig nx = 90 lwork = 81000 minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 getting the sphering matrix ... minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 sphering the data ... minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 numeigs = 90 1 : Allocating variables ... 2 : Allocating variables ... 1 : Initializing variables ... 2 : Initializing variables ... 1 : block size = 128 2 : block size = 128 1 : entering the main loop ... 2 : entering the main loop ... forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source amica15ub 00000000013E61B0 Unknown Unknown Unknown amica15ub 0000000000591297 Unknown Unknown Unknown amica15ub 00000000004D731C Unknown Unknown Unknown amica15ub 00000000004EFCFA Unknown Unknown Unknown amica15ub 00000000004F0081 Unknown Unknown Unknown amica15ub 00000000004D64FE Unknown Unknown Unknown amica15ub 000000000048D296 Unknown Unknown Unknown amica15ub 000000000047B29E Unknown Unknown Unknown amica15ub 00000000004798B1 Unknown Unknown Unknown amica15ub 000000000047D750 Unknown Unknown Unknown amica15ub 000000000047C875 Unknown Unknown Unknown amica15ub 000000000046E078 Unknown Unknown Unknown amica15ub 0000000000463F98 Unknown Unknown Unknown amica15ub 000000000041E93B Unknown Unknown Unknown amica15ub 000000000040288D Unknown Unknown Unknown amica15ub 00000000013DCD4A Unknown Unknown Unknown amica15ub 00000000013DE5A7 Unknown Unknown Unknown amica15ub 0000000000402765 Unknown Unknown Unknown Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed MPIR_Reduce_impl(1070)................: MPIR_Reduce_intra(868)................: MPIR_Reduce_redscat_gather(622).......: MPIDU_Complete_posted_with_error(1137): Process failed
— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535353207 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESTNWENE73PATYTAE6DXEQEWRANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESUS7LLKI2XXDTD3A4TXEQEWRA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QOQXO.gif Message ID: @. @.> >
It may be that the number of threads has to divide the matrix block size (128 in your case). If you use do_opt_block it will try to optimize the block size over 128 – 1024 in steps of 128, but may crash if it tries a block size too large for the nodes. You can manually set the initial_matrix_block_size to something greater than 128.
From: vlawhern @.> Sent: Thursday, May 4, 2023 4:13 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
Processing arguments ... num_files = 1 FILES: /mnt/growler/barleyhome/vlawhern/tmpdata63236.fdt num_dir_files = 1 initial matrix block_size = 128 do_opt_block = 0 blk_min = 256 blk_step = 256 blk_max = 1024 number of models = 8 max_thrds = 20 use_min_dll = 1 min dll = 1.000000000000000E-009 use_grad_norm = 1 min grad norm = 1.000000000000000E-007 number of density mixture components = 3 pdf type = 0 max_iter = 2000 num_samples = 1 data_dim = 90 field_dim = 1744896 do_history = 0 histstep = 10 share_comps = 0 share_start = 100 comp_thresh = 0.990000000000000 share_int = 100 initial lrate = 5.000000000000000E-002 minimum lrate = 1.000000000000000E-008 minimum data covariance eigenvalue = 1.000000000000000E-012 lrate factor = 0.500000000000000 initial rholrate = 5.000000000000000E-002 rho0 = 1.50000000000000 min rho = 1.00000000000000 max rho = 2.00000000000000 rho lrate factor = 0.500000000000000 kurt_start = 3 num kurt = 5 kurt interval = 1 do_newton = 1 newt_start = 50 newt_ramp = 10 initial newton lrate = 1.00000000000000 do_reject = 1 num reject = 3 reject sigma = 3.00000000000000 reject start = 2 reject interval = 3 write step = 20 write_nd = 0 write_LLt = 1 dec window = 1 max_decs = 3 fix_init = 0 update_A = 1 update_c = 1 update_gm = 1 update_alpha = 1 update_mu = 1 update_beta = 1 invsigmax = 100.000000000000 invsigmin = 0.000000000000000E+000 do_rho = 1 load_rej = 0 load_c = 0 load_gm = 0 load_alpha = 0 load_mu = 0 load_beta = 0 load_rho = 0 load_comp_list = 0 do_mean = 1 do_sphere = 1 pcakeep = 90 pcadb = 30.0000000000000 byte_size = 4 doscaling = 1 scalestep = 1 mkdir: cannot create directory '/mnt/growler/barleyhome/vlawhern/amicaouttmp/': File exists output directory = /mnt/growler/barleyhome/vlawhern/amicaouttmp/ 1 : setting num_thrds to 20 ... 2 : setting num_thrds to 20 ... 1 : using 20 threads. 2 : using 20 threads. 1 : node_thrds = 20 20 bytes in real = 1 1 : REAL nbyte = 1 getting segment list ... blocks in sample = 1744896 total blocks = 1744896 node blocks = 872448 872448 node 1 start: file 1 sample 1 index 1 node 1 stop : file 1 sample 1 index 872448 node 2 start: file 1 sample 1 index 872449 node 2 stop : file 1 sample 1 index 1744896 1 : data = -2.10404443740845 -2.00284409523010 2 : data = 2.67275834083557 2.30718922615051 getting the mean ... mean = -8.140678184955731E-002 -4.924335967420339E-002 1.604046472164474E-002 subtracting the mean ... getting the covariance matrix ... cnt = 1744896 doing eig nx = 90 lwork = 81000 doing eig nx = 90 lwork = 81000 minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 getting the sphering matrix ... minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 sphering the data ... minimum eigenvalues = 3.979936589306554E-003 4.130480204090477E-003 4.307901288673726E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 numeigs = 90 1 : Allocating variables ... 2 : Allocating variables ... 1 : Initializing variables ... 2 : Initializing variables ... 1 : block size = 128 2 : block size = 128 1 : entering the main loop ... 2 : entering the main loop ... forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source amica15ub 00000000013E61B0 Unknown Unknown Unknown amica15ub 0000000000591297 Unknown Unknown Unknown amica15ub 00000000004D731C Unknown Unknown Unknown amica15ub 00000000004EFCFA Unknown Unknown Unknown amica15ub 00000000004F0081 Unknown Unknown Unknown amica15ub 00000000004D64FE Unknown Unknown Unknown amica15ub 000000000048D296 Unknown Unknown Unknown amica15ub 000000000047B29E Unknown Unknown Unknown amica15ub 00000000004798B1 Unknown Unknown Unknown amica15ub 000000000047D750 Unknown Unknown Unknown amica15ub 000000000047C875 Unknown Unknown Unknown amica15ub 000000000046E078 Unknown Unknown Unknown amica15ub 0000000000463F98 Unknown Unknown Unknown amica15ub 000000000041E93B Unknown Unknown Unknown amica15ub 000000000040288D Unknown Unknown Unknown amica15ub 00000000013DCD4A Unknown Unknown Unknown amica15ub 00000000013DE5A7 Unknown Unknown Unknown amica15ub 0000000000402765 Unknown Unknown Unknown Fatal error in PMPI_Reduce: Unknown error class, error stack: PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7ff454872a00, rbuf=0x7ff456e109c0, count=64800, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed MPIR_Reduce_impl(1070)................: MPIR_Reduce_intra(868)................: MPIR_Reduce_redscat_gather(622).......: MPIDU_Complete_posted_with_error(1137): Process failed
— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535353207 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESTNWENE73PATYTAE6DXEQEWRANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESUS7LLKI2XXDTD3A4TXEQEWRA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QOQXO.gif Message ID: @. @.> >
so I've tried various combinations of block_size and max_threads and now I get some different errors, which just points to this being a MPI/cluster issue... our cluster is just 1Gb ethernet so perhaps there's something going wrong because of our slow network.
Processing arguments ...
num_files = 1
FILES:
/barleyhome/vlawhern/tmpdata26667.fdt
num_dir_files = 1
initial matrix block_size = 256
do_opt_block = 0
blk_min = 256
blk_step = 256
blk_max = 1024
number of models = 8
max_thrds = 8
use_min_dll = 1
min dll = 9.999999999999999E-022
use_grad_norm = 1
min grad norm = 1.000000000000000E-019
number of density mixture components = 3
pdf type = 0
max_iter = 2000
num_samples = 1
data_dim = 90
field_dim = 1744896
do_history = 0
histstep = 10
share_comps = 0
share_start = 100
comp_thresh = 0.990000000000000
share_int = 100
initial lrate = 5.000000000000000E-002
minimum lrate = 1.000000000000000E-008
minimum data covariance eigenvalue = 1.000000000000000E-012
lrate factor = 0.500000000000000
initial rholrate = 5.000000000000000E-002
rho0 = 1.50000000000000
min rho = 1.00000000000000
max rho = 2.00000000000000
rho lrate factor = 0.500000000000000
kurt_start = 3
num kurt = 5
kurt interval = 1
do_newton = 1
newt_start = 50
newt_ramp = 10
initial newton lrate = 1.00000000000000
do_reject = 1
num reject = 3
reject sigma = 3.00000000000000
reject start = 2
reject interval = 3
write step = 20
write_nd = 0
write_LLt = 1
dec window = 1
max_decs = 3
fix_init = 0
update_A = 1
update_c = 1
update_gm = 1
update_alpha = 1
update_mu = 1
update_beta = 1
invsigmax = 100.000000000000
invsigmin = 0.000000000000000E+000
do_rho = 1
load_rej = 0
load_c = 0
load_gm = 0
load_alpha = 0
load_mu = 0
load_beta = 0
load_rho = 0
load_comp_list = 0
do_mean = 1
do_sphere = 1
pcakeep = 90
pcadb = 30.0000000000000
byte_size = 4
doscaling = 1
scalestep = 1
mkdir: cannot create directory '/barleyhome/vlawhern/amicaouttmp/': File exists
output directory = /barleyhome/vlawhern/amicaouttmp/
1 : setting num_thrds to 8 ...
2 : setting num_thrds to 8 ...
3 : setting num_thrds to 8 ...
4 : setting num_thrds to 8 ...
1 : using 8 threads.
3 : using 8 threads.
2 : using 8 threads.
4 : using 8 threads.
1 : node_thrds = 8 8 8 8
bytes in real = 1
1 : REAL nbyte = 1
getting segment list ...
blocks in sample = 1744896
total blocks = 1744896
node blocks = 436224 436224 436224 436224
node 1 start: file 1 sample 1 index
1
node 1 stop : file 1 sample 1 index
436224
node 2 start: file 1 sample 1 index
436225
node 2 stop : file 1 sample 1 index
872448
node 3 start: file 1 sample 1 index
872449
node 3 stop : file 1 sample 1 index
1308672
node 4 start: file 1 sample 1 index
1308673
node 4 stop : file 1 sample 1 index
1744896
4 : data = 1.82832920551300 1.67610311508179
3 : data = 2.67275834083557 2.30718922615051
2 : data = 0.152651995420456 -3.927214443683624E-002
1 : data = -2.10404443740845 -2.00284409523010
getting the mean ...
mean = -8.140678184955731E-002 -4.924335967420339E-002
1.604046472164474E-002
subtracting the mean ...
getting the covariance matrix ...
cnt = 1744896
doing eig nx = 90 lwork = 81000
doing eig nx = 90 lwork = 81000
minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003
4.307901288574601E-003
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 90
getting the sphering matrix ...
doing eig nx = 90 lwork = 81000
doing eig nx = 90 lwork = 81000
minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003
4.307901288574601E-003
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 90
sphering the data ...
minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003
4.307901288574601E-003
minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003
4.307901288574601E-003
minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003
4.307901288574601E-003
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 0
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 0
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 0
numeigs = 90
3 : Allocating variables ...
1 : Allocating variables ...
4 : Allocating variables ...
2 : Allocating variables ...
4 : Initializing variables ...
3 : Initializing variables ...
2 : Initializing variables ...
1 : Initializing variables ...
1 : block size = 256
2 : block size = 256
3 : block size = 256
4 : block size = 256
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
amica15ub 00000000013E61B0 Unknown Unknown Unknown
amica15ub 00000000004C07A4 Unknown Unknown Unknown
amica15ub 00000000004CB61B Unknown Unknown Unknown
amica15ub 00000000004CDF08 Unknown Unknown Unknown
amica15ub 000000000048D1C5 Unknown Unknown Unknown
amica15ub 0000000000472D69 Unknown Unknown Unknown
amica15ub 000000000046DF8C Unknown Unknown Unknown
amica15ub 000000000041E393 Unknown Unknown Unknown
amica15ub 000000000040288D Unknown Unknown Unknown
amica15ub 00000000013DCD4A Unknown Unknown Unknown
amica15ub 00000000013DE5A7 Unknown Unknown Unknown
amica15ub 0000000000402765 Unknown Unknown Unknown
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425).....................: MPI_Barrier(comm=0x84000001) failed
MPIR_Barrier_impl(332)................: Failure during collective
MPIR_Barrier_impl(327)................:
MPIR_Barrier(292).....................:
MPIR_Barrier_intra(169)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425).....................: MPI_Barrier(comm=0x84000001) failed
MPIR_Barrier_impl(332)................: Failure during collective
MPIR_Barrier_impl(327)................:
MPIR_Barrier(292).....................:
MPIR_Barrier_intra(169)...............:
MPIDU_Complete_posted_with_error(1137): Process failed
Fatal error in PMPI_Barrier: Unknown error class, error stack:
PMPI_Barrier(425)......: MPI_Barrier(comm=0x84000001) failed
MPIR_Barrier_impl(332).: Failure during collective
MPIR_Barrier_impl(327).:
MPIR_Barrier(292)......:
MPIR_Barrier_intra(180): Failure during collective
There may be an issue with the sphering. It seems to be duplicating the sphering over nodes, which may indicate that the MPI is not working correctly. Are you using the MPI from Intel OneAPI, or MPICH2 (or other)?
From: vlawhern @.> Sent: Thursday, May 4, 2023 4:45 PM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
so I've tried various combinations of block_size and max_threads and now I get some different errors, which just points to this being a MPI/cluster issue... our cluster is just 1Gb ethernet so perhaps there's something going wrong because of our slow network.
Processing arguments ... num_files = 1 FILES: /barleyhome/vlawhern/tmpdata26667.fdt num_dir_files = 1 initial matrix block_size = 256 do_opt_block = 0 blk_min = 256 blk_step = 256 blk_max = 1024 number of models = 8 max_thrds = 8 use_min_dll = 1 min dll = 9.999999999999999E-022 use_grad_norm = 1 min grad norm = 1.000000000000000E-019 number of density mixture components = 3 pdf type = 0 max_iter = 2000 num_samples = 1 data_dim = 90 field_dim = 1744896 do_history = 0 histstep = 10 share_comps = 0 share_start = 100 comp_thresh = 0.990000000000000 share_int = 100 initial lrate = 5.000000000000000E-002 minimum lrate = 1.000000000000000E-008 minimum data covariance eigenvalue = 1.000000000000000E-012 lrate factor = 0.500000000000000 initial rholrate = 5.000000000000000E-002 rho0 = 1.50000000000000 min rho = 1.00000000000000 max rho = 2.00000000000000 rho lrate factor = 0.500000000000000 kurt_start = 3 num kurt = 5 kurt interval = 1 do_newton = 1 newt_start = 50 newt_ramp = 10 initial newton lrate = 1.00000000000000 do_reject = 1 num reject = 3 reject sigma = 3.00000000000000 reject start = 2 reject interval = 3 write step = 20 write_nd = 0 write_LLt = 1 dec window = 1 max_decs = 3 fix_init = 0 update_A = 1 update_c = 1 update_gm = 1 update_alpha = 1 update_mu = 1 update_beta = 1 invsigmax = 100.000000000000 invsigmin = 0.000000000000000E+000 do_rho = 1 load_rej = 0 load_c = 0 load_gm = 0 load_alpha = 0 load_mu = 0 load_beta = 0 load_rho = 0 load_comp_list = 0 do_mean = 1 do_sphere = 1 pcakeep = 90 pcadb = 30.0000000000000 byte_size = 4 doscaling = 1 scalestep = 1 mkdir: cannot create directory '/barleyhome/vlawhern/amicaouttmp/': File exists output directory = /barleyhome/vlawhern/amicaouttmp/ 1 : setting num_thrds to 8 ... 2 : setting num_thrds to 8 ... 3 : setting num_thrds to 8 ... 4 : setting num_thrds to 8 ... 1 : using 8 threads. 3 : using 8 threads. 2 : using 8 threads. 4 : using 8 threads. 1 : node_thrds = 8 8 8 8 bytes in real = 1 1 : REAL nbyte = 1 getting segment list ... blocks in sample = 1744896 total blocks = 1744896 node blocks = 436224 436224 436224 436224 node 1 start: file 1 sample 1 index 1 node 1 stop : file 1 sample 1 index 436224 node 2 start: file 1 sample 1 index 436225 node 2 stop : file 1 sample 1 index 872448 node 3 start: file 1 sample 1 index 872449 node 3 stop : file 1 sample 1 index 1308672 node 4 start: file 1 sample 1 index 1308673 node 4 stop : file 1 sample 1 index 1744896 4 : data = 1.82832920551300 1.67610311508179 3 : data = 2.67275834083557 2.30718922615051 2 : data = 0.152651995420456 -3.927214443683624E-002 1 : data = -2.10404443740845 -2.00284409523010 getting the mean ... mean = -8.140678184955731E-002 -4.924335967420339E-002 1.604046472164474E-002 subtracting the mean ... getting the covariance matrix ... cnt = 1744896 doing eig nx = 90 lwork = 81000 doing eig nx = 90 lwork = 81000 minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 getting the sphering matrix ... doing eig nx = 90 lwork = 81000 doing eig nx = 90 lwork = 81000 minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 90 sphering the data ... minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 minimum eigenvalues = 3.979936589245825E-003 4.130480204099402E-003 4.307901288574601E-003 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 maximum eigenvalues = 2663.49088331849 1448.05399130587 850.039381071617 num eigs kept = 0 numeigs = 90 3 : Allocating variables ... 1 : Allocating variables ... 4 : Allocating variables ... 2 : Allocating variables ... 4 : Initializing variables ... 3 : Initializing variables ... 2 : Initializing variables ... 1 : Initializing variables ... 1 : block size = 256 2 : block size = 256 3 : block size = 256 4 : block size = 256 forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source amica15ub 00000000013E61B0 Unknown Unknown Unknown amica15ub 00000000004C07A4 Unknown Unknown Unknown amica15ub 00000000004CB61B Unknown Unknown Unknown amica15ub 00000000004CDF08 Unknown Unknown Unknown amica15ub 000000000048D1C5 Unknown Unknown Unknown amica15ub 0000000000472D69 Unknown Unknown Unknown amica15ub 000000000046DF8C Unknown Unknown Unknown amica15ub 000000000041E393 Unknown Unknown Unknown amica15ub 000000000040288D Unknown Unknown Unknown amica15ub 00000000013DCD4A Unknown Unknown Unknown amica15ub 00000000013DE5A7 Unknown Unknown Unknown amica15ub 0000000000402765 Unknown Unknown Unknown Fatal error in PMPI_Barrier: Unknown error class, error stack: PMPI_Barrier(425).....................: MPI_Barrier(comm=0x84000001) failed MPIR_Barrier_impl(332)................: Failure during collective MPIR_Barrier_impl(327)................: MPIR_Barrier(292).....................: MPIR_Barrier_intra(169)...............: MPIDU_Complete_posted_with_error(1137): Process failed Fatal error in PMPI_Barrier: Unknown error class, error stack: PMPI_Barrier(425).....................: MPI_Barrier(comm=0x84000001) failed MPIR_Barrier_impl(332)................: Failure during collective MPIR_Barrier_impl(327)................: MPIR_Barrier(292).....................: MPIR_Barrier_intra(169)...............: MPIDU_Complete_posted_with_error(1137): Process failed Fatal error in PMPI_Barrier: Unknown error class, error stack: PMPI_Barrier(425)......: MPI_Barrier(comm=0x84000001) failed MPIR_Barrier_impl(332).: Failure during collective MPIR_Barrier_impl(327).: MPIR_Barrier(292)......: MPIR_Barrier_intra(180): Failure during collective
— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1535388201 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESQ43N35GKVODANTKALXEQIMXANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBEST6QMNUXEWZFFL3IRTXEQIMXA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3QQVCS.gif Message ID: @. @.> >
so I managed to test this out on a dedicated HPC platform (using the HPC system's MPI implementation which is based on Intel OneAPI) across 4 nodes with AMICA at 8 models and I get the same errors as above. So that seems to suggest it's not the MPI implementation that's the problem, even though all the errors point to MPI.
I still find it suspicious that if I set 'pcakeep' to 32 (so do a lot of dimension reduction) that everything works, with AMICA 8 models across 8 nodes. It also works if I just select 32 channels from the 90 channels to do the decomposition, so no pca dimension reduction. Once I hit around 40-50 channels things fail. So it might be something with my data... if I check for the data rank it gives me 90 channels and its eigenvalues all appear to be fine (above say a 1E-7 tolerance). I'll have to see if I can find a similar sized data open-source to test this further.
Setting num_threads=1 produces a slightly different set of MPI errors, but in general I always get either the "MPI_Barrier" error or something with "MPI_Reduce"... the fact I get different errors with different combinations of models/nodes also seems to maybe point to the data, not MPI?
just for completeness, the command I call from MATLAB is
system(['mpirun -n ' num2str(numprocs) ' -machinefile ~/hostfile ~/software/amica/amica15ub ' outdir 'input.param' ]);
where hostfile is just a list of hostnames, one on each line.
Processing arguments ...
num_files = 1
FILES:
/mnt/growler/barleyhome/vlawhern/tmpdata84913.fdt
num_dir_files = 1
initial matrix block_size = 128
do_opt_block = 0
blk_min = 256
blk_step = 256
blk_max = 1024
number of models = 8
max_thrds = 1
use_min_dll = 1
min dll = 1.000000000000000E-009
use_grad_norm = 1
min grad norm = 1.000000000000000E-007
number of density mixture components = 3
pdf type = 0
max_iter = 2000
num_samples = 1
data_dim = 90
field_dim = 1744896
do_history = 0
histstep = 10
share_comps = 0
share_start = 100
comp_thresh = 0.990000000000000
share_int = 100
initial lrate = 5.000000000000000E-002
minimum lrate = 1.000000000000000E-008
minimum data covariance eigenvalue = 1.000000000000000E-012
lrate factor = 0.500000000000000
initial rholrate = 5.000000000000000E-002
rho0 = 1.50000000000000
min rho = 1.00000000000000
max rho = 2.00000000000000
rho lrate factor = 0.500000000000000
kurt_start = 3
num kurt = 5
kurt interval = 1
do_newton = 1
newt_start = 50
newt_ramp = 10
initial newton lrate = 1.00000000000000
do_reject = 0
num reject = 3
reject sigma = 3.00000000000000
reject start = 2
reject interval = 3
write step = 20
write_nd = 0
write_LLt = 1
dec window = 1
max_decs = 3
fix_init = 0
update_A = 1
update_c = 1
update_gm = 1
update_alpha = 1
update_mu = 1
update_beta = 1
invsigmax = 100.000000000000
invsigmin = 0.000000000000000E+000
do_rho = 1
load_rej = 0
load_c = 0
load_gm = 0
load_alpha = 0
load_mu = 0
load_beta = 0
load_rho = 0
load_comp_list = 0
do_mean = 1
do_sphere = 1
pcakeep = 64
pcadb = 30.0000000000000
byte_size = 4
doscaling = 1
scalestep = 1
mkdir: cannot create directory '/mnt/growler/barleyhome/vlawhern/amicaouttmp/': File exists
output directory = /mnt/growler/barleyhome/vlawhern/amicaouttmp/
1 : setting num_thrds to 1 ...
3 : setting num_thrds to 1 ...
2 : setting num_thrds to 1 ...
4 : setting num_thrds to 1 ...
2 : using 1 threads.
3 : using 1 threads.
1 : using 1 threads.
4 : using 1 threads.
1 : node_thrds = 1 1 1 1
bytes in real = 1
1 : REAL nbyte = 1
getting segment list ...
blocks in sample = 1744896
total blocks = 1744896
node blocks = 436224 436224 436224 436224
node 1 start: file 1 sample 1 index
1
node 1 stop : file 1 sample 1 index
436224
node 2 start: file 1 sample 1 index
436225
node 2 stop : file 1 sample 1 index
872448
node 3 start: file 1 sample 1 index
872449
node 3 stop : file 1 sample 1 index
1308672
node 4 start: file 1 sample 1 index
1308673
node 4 stop : file 1 sample 1 index
1744896
4 : data = 1.82832920551300 1.67610311508179
3 : data = 2.67275834083557 2.30718922615051
2 : data = 0.152651995420456 -3.927214443683624E-002
1 : data = -2.10404443740845 -2.00284409523010
getting the mean ...
mean = -8.140678184955731E-002 -4.924335967420339E-002
1.604046472164474E-002
subtracting the mean ...
getting the covariance matrix ...
cnt = 1744896
doing eig nx = 90 lwork = 81000
doing eig nx = 90 lwork = 81000
doing eig nx = 90 lwork = 81000
minimum eigenvalues = 3.979936589214525E-003 4.130480204143833E-003
4.307901288597552E-003
doing eig nx = 90 lwork = 81000
minimum eigenvalues = 3.979936589214525E-003 4.130480204143833E-003
4.307901288597552E-003
minimum eigenvalues = 3.979936589214525E-003 4.130480204143833E-003
4.307901288597552E-003
minimum eigenvalues = 3.979936589214525E-003 4.130480204143833E-003
4.307901288597552E-003
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 0
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 0
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 0
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 64
getting the sphering matrix ...
minimum eigenvalues = 3.979936589214525E-003 4.130480204143833E-003
4.307901288597552E-003
maximum eigenvalues = 2663.49088331849 1448.05399130587
850.039381071617
num eigs kept = 64
sphering the data ...
numeigs = 64
1 : Allocating variables ...
3 : Allocating variables ...
2 : Allocating variables ...
4 : Allocating variables ...
3 : Initializing variables ...
2 : Initializing variables ...
4 : Initializing variables ...
1 : Initializing variables ...
1 : block size = 128
3 : block size = 128
2 : block size = 128
4 : block size = 128
4 : entering the main loop ...
1 : entering the main loop ...
3 : entering the main loop ...
2 : entering the main loop ...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
amica15ub 000000000116EB35 Unknown Unknown Unknown
amica15ub 000000000116C8F7 Unknown Unknown Unknown
amica15ub 0000000001122954 Unknown Unknown Unknown
amica15ub 0000000001122766 Unknown Unknown Unknown
amica15ub 00000000010D4D19 Unknown Unknown Unknown
amica15ub 00000000010D8F90 Unknown Unknown Unknown
amica15ub 00000000005D21F0 Unknown Unknown Unknown
amica15ub 00000000005993FC Unknown Unknown Unknown
amica15ub 00000000004DC39A Unknown Unknown Unknown
amica15ub 00000000004F5900 Unknown Unknown Unknown
amica15ub 00000000004F5C41 Unknown Unknown Unknown
amica15ub 00000000004DB633 Unknown Unknown Unknown
amica15ub 000000000048E485 Unknown Unknown Unknown
amica15ub 000000000047AB46 Unknown Unknown Unknown
amica15ub 00000000004790B9 Unknown Unknown Unknown
amica15ub 000000000047CF50 Unknown Unknown Unknown
amica15ub 000000000047C0D4 Unknown Unknown Unknown
amica15ub 000000000046CD18 Unknown Unknown Unknown
amica15ub 0000000000462215 Unknown Unknown Unknown
amica15ub 0000000000417EA3 Unknown Unknown Unknown
amica15ub 00000000004021DE Unknown Unknown Unknown
amica15ub 000000000118C1A4 Unknown Unknown Unknown
amica15ub 00000000004020C1 Unknown Unknown Unknown
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7fb954cbe010, rbuf=0x7fb954cff010, count=32768, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(469).......:
MPIDU_Complete_posted_with_error(1137): Process failed
MPIR_Reduce_redscat_gather(605).......:
MPIC_Send(300)........................:
MPID_Send(75).........................: Communication error with rank 1
Fatal error in PMPI_Reduce: Unknown error class, error stack:
PMPI_Reduce(1258).....................: MPI_Reduce(sbuf=0x7fa27dbd7010, rbuf=0x7fa27dc18010, count=32768, MPI_DOUBLE_PRECISION, MPI_SUM, root=0, comm=0x84000001) failed
MPIR_Reduce_impl(1070)................:
MPIR_Reduce_intra(868)................:
MPIR_Reduce_redscat_gather(622).......:
MPIDU_Complete_posted_with_error(1137): Process failed
interestingly if I keep trying to run the same command several times.. once in a while it will work (maybe once out of 10 attempts). So it's possible that there something with random initialization of weights that is causing some issues?
Given the dependence on channel number, I think it might be an issue with stack overflow. Can you try setting the initial matrix block size to 32 and see if that improves the success rate?
From: vlawhern @.> Sent: Friday, May 5, 2023 10:30 AM To: sccn/amica @.> Cc: Jason Palmer @.>; Comment @.> Subject: Re: [sccn/amica] AMICA on distributed systems crashing with large number of models and high density decompositions (Issue #40)
interestingly if I keep trying to run the same command several times.. once in a while it will work (maybe once out of 10 attempts). So it's possible that there something with random initialization of weights that is causing some issues?
— Reply to this email directly, view it on GitHub https://github.com/sccn/amica/issues/40#issuecomment-1536348662 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRBESUF4TNES4RHKQ5WDYLXEUFGHANCNFSM6AAAAAAXV7FKS4 . You are receiving this because you commented. https://github.com/notifications/beacon/ACRBESQMTCLSE6DAA2CNTQLXEUFGHA5CNFSM6AAAAAAXV7FKS6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS3SLI7M.gif Message ID: @. @.> >
Just getting back to this.. unfortunately setting block_size to 32 doesn't fix it.
So the summary of my issue is that I can run AMICA on a single node with any combination of channels/models without issue, but when I start trying to distribute the AMICA model I run into issues that are solved if I reduce the number of channels, number of models and/or number of nodes. Generally speaking, the more models I try to fit, the less nodes I can use. Same thing with number of channels.. if I distribute the model across many nodes I sometimes have to reduce the channels for it to run.
Strangely enough AMICA will "sometimes" work without having to reduce channels/nodes/models (maybe 1 out of 20 attempts), which suggests possibly some kind of bad initialization of variables?
I cannot rule out the issue being the data itself however.
on one run I was able to start, it did throw NaNs though..
iter 1 lrate = 0.0500000000 LL = -1.4122133988 nd = 0.0377896272, D = 0.34899E-01 0.32313E-01 ( 13.41 s, 0.1 h)
iter 2 lrate = 0.0500000000 LL = -1.3408517205 nd = 0.0458156159, D = 0.35362E-01 0.32421E-01 ( 13.62 s, 0.1 h)
Doing rejection ....
maximum likelihood value = -0.825738566146084
minimum likelihood value = -33.4286597657578
average likelihood value = -1.34085172047297
standard deviation = 0.331891165020141
rejecting data with likelihood less than -2.33652521553340
rejected 17341 data points so far. Will perform rejection 2
more times at intervals of 3 iterations.
iter 3 lrate = 0.0500000000 LL = -1.2840568489 nd = 0.0584500863, D = 0.16122E+00 0.42687E-01 ( 13.62 s, 0.1 h)
iter 4 lrate = 0.0500000000 LL = -1.2541466667 nd = 0.0716875099, D = 0.53389E+00 0.10074E+00 ( 13.89 s, 0.1 h)
iter 5 lrate = 0.0500000000 LL = -1.2235514437 nd = 0.0770498115, D = 0.12020E+01 0.24694E+00 ( 14.01 s, 0.1 h)
Doing rejection ....
maximum likelihood value = -0.433675846405276
minimum likelihood value = -2.30432150510246
average likelihood value = -1.22355144373999
standard deviation = 0.221314189217103
rejecting data with likelihood less than -1.88749401139129
rejected 24580
data points so far. Will perform rejection one more time after 3
iterations.
iter 6 lrate = 0.0500000000 LL = -1.1898742223 nd = 0.0795726861, D = 0.22151E+01 0.51242E+00 ( 14.01 s, 0.1 h)
iter 7 lrate = 0.0500000000 LL = -1.1590266760 nd = 0.0807673208, D = 0.35559E+01 0.87024E+00 ( 14.00 s, 0.1 h)
iter 8 lrate = 0.0500000000 LL = -1.1299826128 nd = 0.0809009705, D = 0.52734E+01 0.13227E+01 ( 14.00 s, 0.0 h)
Doing rejection ....
maximum likelihood value = -0.308914812838824
minimum likelihood value = -2.10933240322047
average likelihood value = -1.12998261283671
standard deviation = 0.206985385449063
rejecting data with likelihood less than -1.75093876918390
rejected 33264 data points. No further rejections will be performed.
iter 9 lrate = 0.0500000000 LL = -1.0981002609 nd = 0.0806803867, D = 0.72688E+01 0.18587E+01 ( 14.01 s, 0.0 h)
iter 10 lrate = 0.0500000000 LL = -1.0701106300 nd = 0.0879733818, D = 0.94008E+01 0.24690E+01 ( 13.82 s, 0.0 h)
iter 11 lrate = 0.0500000000 LL = -1.0699184876 nd = 0.3447542953, D = 0.11750E+02 0.31434E+01 ( 13.81 s, 0.0 h)
iter 12 lrate = 0.0500000000 LL = -1.0346399046 nd = 0.1538665440, D = 0.16827E+02 0.38720E+01 ( 13.97 s, 0.0 h)
iter 13 lrate = 0.0500000000 LL = -1.0143801026 nd = 0.4059233348, D = 0.18721E+02 0.46582E+01 ( 13.98 s, 0.0 h)
iter 14 lrate = 0.0500000000 LL = -1.0762979759 nd = 0.9911425666, D = 0.24802E+02 0.54779E+01 ( 13.91 s, 0.0 h)
Likelihood decreasing!
iter 15 lrate = 0.0500000000 LL = -1.0585870497 nd = 1.4571456080, D = 0.37602E+02 0.63122E+01 ( 13.96 s, 0.0 h)
iter 16 lrate = 0.0500000000 LL = -1.0390555955 nd = 3.9317816941, D = 0.44403E+02 0.72538E+01 ( 13.85 s, 0.0 h)
iter 17 lrate = 0.0500000000 LL = -1.1500486359 nd = 11.7298001155, D = 0.10362E+03 0.81841E+01 ( 13.86 s, 0.0 h)
Likelihood decreasing!
iter 18 lrate = 0.0500000000 LL = -1.0736096696 nd = *************, D = 0.21085E+03 0.91530E+01 ( 13.91 s, 0.0 h)
iter 19 lrate = 0.0500000000 LL = NaN nd = NaN, D = 0.46975E+03 0.10185E+02 ( 13.83 s, 0.0 h)
Got NaN! Exiting ...
So maybe this again points to the data itself as the problem
So just circling back on this thread... it turns out it's our data that's the problem. Checking for the rank of the data in smaller chunks (as opposed to checking the rank of the full data) shows that the data isn't full rank all the time. Our processing pipeline includes artifact rejection with ASR (artifact subspace reconstruction) so it's likely this is causing the issue as ASR uses PCA to do component removal and many components are being removed in certain sections of the data. I now strongly believe that the issue is because AMICA in distributed mode each worker gets a chunk of the data, and each chunk is now not likely to be full-rank because of ASR this is causing all the issues as discussed in this thread.
More workers -> each worker gets a smaller chunk of the data -> more likely to run into rank-issues for at least one of the workers -> thus causing crashes.
I wonder if there should be a rank-check of the data for each chunk when AMICA is being run in distributed mode to avoid this issue going forward.
However I do have some questions; you do have any intuition about the behavior of AMICA decompositions when data is not full rank across subsets of the data. Say for 64-channel data you check the rank of the data in 30-second chunks and get a range of like [16, 64]. So you try and fit AMICA with 32 components.. in the range where the data is truly rank 16.. the 32-component decomposition of that section could probably look really strange. Should you try and fit the data to min(rank of chunks) to guarantee each worker gets "good" data (e.g. full-rank)?
I'm trying to run AMICA across a large number of nodes in a distributed fashion and I'm finding that for certain combinations of (1) num_models, (2) number of compute nodes and (3) the number of channels of the EEG data that AMICA crashes. I suspect it might be for very large number of compute nodes that there isn't enough data per node to do high density decompositions.
I was wondering if there were general rules about when AMICA would work in a distributed manner for say
T = length of the data N = number of compute nodes M = number of models C = number of channels
Seems like T/N is the amount of data given per node, so it comes to some relationship between M and C for T/N, but it'd be nice if there was some guidance on what values of T/N/M/C would work for distributed AMICA.