stan-dev / cmdstanpy

CmdStanPy is a lightweight interface to Stan for Python users which provides the necessary objects and functions to compile a Stan program and fit the model to data using CmdStan.
BSD 3-Clause "New" or "Revised" License
151 stars 69 forks source link

Stan terminating with no error messages on CentOS linux 7 (core) cluster, but runs on my personal device (Ubunutu LTS 22.04) #753

Open Garren-H opened 4 months ago

Garren-H commented 4 months ago

Description

Hi all. I have ran a stan program on a cluster running on CentOS linux 7 (core), but stan just terminated without warning nor error messages.

The stan code is: ``` functions { vector NRTL(vector x, vector T, vector p12, vector p21, real a, matrix map_tij, matrix map_tij_dT) { int N = rows(x); vector[N] t12 = map_tij * p12; vector[N] t21 = map_tij * p21; vector[N] dt12_dT = map_tij_dT * p12; vector[N] dt21_dT = map_tij_dT * p21; vector[N] at12 = a * t12; vector[N] at21 = a * t21; vector[N] G12 = exp(-at12); vector[N] G21 = exp(-at21); vector[N] term1 = ( ( (1-x) .* G12 .* (1 - at12) + x .* square(G12) ) ./ square((1-x) + x .* G12) ) .* dt12_dT; vector[N] term2 = ( ( x .* G21 .* (1 - at21) + (1-x) .* square(G21) ) ./ square(x + (1-x) .* G21) ) .* dt21_dT; return -8.314 * square(T) .* x .* (1-x) .* ( term1 + term2 ); } real ps_like(array[] int N_slice, int start, int end, vector y, vector x, vector T, array[] matrix U_raw, array[] matrix V_raw, vector v_ARD, vector v, vector scaling, real a, real error, array[] int N_points, array[,] int Idx_known, array[] matrix mapping, vector var_data) { real all_target = 0; for (i in start:end) { vector[4] p12_raw; vector[4] p21_raw; vector[N_points[i]] y_std = sqrt(var_data[sum(N_points[:i-1])+1:sum(N_points[:i])]+v[i]); vector[N_points[i]] y_means; for (j in 1:4) { p12_raw[j] = dot_product(U_raw[j,:,Idx_known[i,1]] .* v_ARD, V_raw[j,:,Idx_known[i,2]]); p21_raw[j] = dot_product(U_raw[j,:,Idx_known[i,2]] .* v_ARD, V_raw[j,:,Idx_known[i,1]]); } y_means = NRTL(x[sum(N_points[:i-1])+1:sum(N_points[:i])], T[sum(N_points[:i-1])+1:sum(N_points[:i])], p12_raw, p21_raw, a, mapping[1][sum(N_points[:i-1])+1:sum(N_points[:i]),:], mapping[2][sum(N_points[:i-1])+1:sum(N_points[:i]),:]); all_target += normal_lpdf(y[sum(N_points[:i-1])+1:sum(N_points[:i])] | y_means, y_std); } return all_target; } } data { int N_known; // number of known data points array[N_known] int N_points; // number of data points in each known data set vector[sum(N_points)] x; // mole fraction vector[sum(N_points)] T; // temperature vector[sum(N_points)] y; // excess enthalpy vector[4] scaling; // scaling factor for NRTL parameter real a; // alpha value for NRTL model int grainsize; // grainsize for parallelization int N; // number of compounds int D; // number of features array[N_known,2] int Idx_known; // indices of known data points vector[N_known] v; // known data-model variance } transformed data { real error = 0.01; // error in the data (fraction of experimental data) vector[sum(N_points)] var_data = square(error*y); // variance of the data array[2] matrix[sum(N_points),4] mapping; // temperature mapping array[N_known] int N_slice; // slice indices for parallelization for (i in 1:N_known) { N_slice[i] = i; } mapping[1] = append_col(append_col(append_col(rep_vector(1.0, sum(N_points)), T), 1.0 ./ T), log(T)); // mapping for tij mapping[1] = mapping[1] .* rep_matrix(scaling', sum(N_points)); // scaling the mapping mapping[2] = append_col(append_col(append_col(rep_vector(0.0, sum(N_points)), rep_vector(1.0, sum(N_points))), -1.0 ./ square(T)), 1.0 ./ T); // mapping for dtij_dT mapping[2] = mapping[2] .* rep_matrix(scaling', sum(N_points)); // scaling the mapping } parameters { array[4] matrix[D,N] U_raw; // feature matrices U array[4] matrix[D,N] V_raw; // feature matrices V real scale; // scale dictating the strenght of ARD effect vector[D] v_ARD; // ARD variances aranged in increasing order with lower bound zero } model { // Gamma Prior for scale profile("Scale Prior"){ scale ~ gamma(1e-9, 1e-9); } // ARD Exponential prior profile("ARD Prior"){ v_ARD ~ exponential(scale); } // Priors for feature matrices profile("Feature Matrices"){ for (i in 1:4) { to_vector(U_raw[i]) ~ std_normal(); to_vector(V_raw[i]) ~ std_normal(); } } // Likelihood function profile("Likelihood"){ target += reduce_sum(ps_like, N_slice, grainsize, y, x, T, U_raw, V_raw, v_ARD, v, scaling, a, error, N_points, Idx_known, mapping, var_data); } } ```
The model compiled and everything, and even did the prelimary gradient evaluations. The (relevant) sample python code is: ``` print('Step1: Sampling sort chain using random initialization') fit = model.sample(data=f'{path}/data.json', output_dir=output_dir1, refresh=1, iter_warmup=5000, iter_sampling=1000, chains=chains, parallel_chains=chains, threads_per_chain=threads_per_chain, max_treedepth=5, metric='dense_e', save_profile=True, sig_figs=18, show_console=True) ```
The output from the stan `.txt` file displays: ``` method = sample (Default) sample num_samples = 1000 (Default) num_warmup = 5000 save_warmup = 0 (Default) thin = 1 (Default) adapt engaged = 1 (Default) gamma = 0.050000 (Default) delta = 0.800000 (Default) kappa = 0.750000 (Default) t0 = 10.000000 (Default) init_buffer = 75 (Default) term_buffer = 50 (Default) window = 25 (Default) save_metric = 0 (Default) algorithm = hmc (Default) hmc engine = nuts (Default) nuts max_depth = 5 metric = dense_e metric_file = (Default) stepsize = 1.000000 (Default) stepsize_jitter = 0.000000 (Default) num_chains = 8 id = 1 (Default) data file = Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/data.json init = 2 (Default) random seed = 96157 output file = /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809.csv diagnostic_file = (Default) refresh = 1 sig_figs = 18 profile_file = /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-profile.csv save_cmdstan_config = 0 (Default) num_threads = 24 (Default) Gradient evaluation took 0.009228 seconds 1000 transitions using 10 leapfrog steps per transition would take 92.28 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.00178 seconds 1000 transitions using 10 leapfrog steps per transition would take 17.8 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001501 seconds 1000 transitions using 10 leapfrog steps per transition would take 15.01 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001735 seconds 1000 transitions using 10 leapfrog steps per transition would take 17.35 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001593 seconds 1000 transitions using 10 leapfrog steps per transition would take 15.93 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001275 seconds 1000 transitions using 10 leapfrog steps per transition would take 12.75 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001483 seconds 1000 transitions using 10 leapfrog steps per transition would take 14.83 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.00134 seconds 1000 transitions using 10 leapfrog steps per transition would take 13.4 seconds. Adjust your expectations accordingly! ```
And the output from `show_console=True` is: ``` Evaluating the following conditions for the Hybrid Model: Include clusters: False Variance known: True Lower rank of feature matrices: 1 Step1: Sampling sort chain using random initialization method = sample (Default) sample num_samples = 1000 (Default) num_warmup = 5000 save_warmup = 0 (Default) thin = 1 (Default) adapt engaged = 1 (Default) gamma = 0.050000 (Default) delta = 0.800000 (Default) kappa = 0.750000 (Default) t0 = 10.000000 (Default) init_buffer = 75 (Default) term_buffer = 50 (Default) window = 25 (Default) save_metric = 0 (Default) algorithm = hmc (Default) hmc engine = nuts (Default) nuts max_depth = 5 metric = dense_e metric_file = (Default) stepsize = 1.000000 (Default) stepsize_jitter = 0.000000 (Default) num_chains = 8 id = 1 (Default) data file = Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/data.json init = 2 (Default) random seed = 96157 output file = /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809.csv diagnostic_file = (Default) refresh = 1 sig_figs = 18 profile_file = /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-profile.csv save_cmdstan_config = 0 (Default) num_threads = 24 (Default) Gradient evaluation took 0.009228 seconds 1000 transitions using 10 leapfrog steps per transition would take 92.28 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.00178 seconds 1000 transitions using 10 leapfrog steps per transition would take 17.8 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001501 seconds 1000 transitions using 10 leapfrog steps per transition would take 15.01 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001735 seconds 1000 transitions using 10 leapfrog steps per transition would take 17.35 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001593 seconds 1000 transitions using 10 leapfrog steps per transition would take 15.93 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001275 seconds 1000 transitions using 10 leapfrog steps per transition would take 12.75 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.001483 seconds 1000 transitions using 10 leapfrog steps per transition would take 14.83 seconds. Adjust your expectations accordingly! Gradient evaluation took 0.00134 seconds 1000 transitions using 10 leapfrog steps per transition would take 13.4 seconds. Adjust your expectations accordingly! ```
The standard error file displays: ``` 21:47:31 - cmdstanpy - INFO - compiling stan file /home/ghermanus/lustre/tmphhck335m/tmpvaov4lbc.stan to exe file /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Hybrid_PMF 21:48:09 - cmdstanpy - INFO - compiled model executable: /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Hybrid_PMF 21:48:09 - cmdstanpy - INFO - CmdStan start processing 21:48:10 - cmdstanpy - INFO - CmdStan done processing 21:48:10 - cmdstanpy - ERROR - CmdStan error: terminated by signal 11 Unknown error -11 Traceback (most recent call last): File "/mnt/lustre/users/ghermanus/Hybrid PMF/Hybrid_PMF.py", line 147, in fit = model.sample(data=f'{path}/data.json', output_dir=output_dir1, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ghermanus/cmdstan_condaforge/lib/python3.12/site-packages/cmdstanpy/model.py", line 1136, in sample raise RuntimeError(msg) RuntimeError: Error during sampling: Command and output files: RunSet: chains=8, chain_ids=[1, 2, 3, 4, 5, 6, 7, 8], num_processes=1 cmd (chain 1): ['/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Hybrid_PMF', 'id=1', 'random', 'seed=96157', 'data', 'file=Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/data.json', 'output', 'file=/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809.csv', 'profile_file=/mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-profile.csv', 'refresh=1', 'sig_figs=18', 'method=sample', 'num_samples=1000', 'num_warmup=5000', 'algorithm=hmc', 'engine=nuts', 'max_depth=5', 'metric=dense_e', 'adapt', 'engaged=1', 'num_chains=8'] retcodes=[-11] per-chain output files (showing chain 1 only): csv_file: /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809_1.csv profile_file: /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-profile_1.csv console_msgs (if any): /mnt/lustre/users/ghermanus/Hybrid PMF/Subsets/Alkane_Primary alcohol/Include_clusters_False/Variance_known_True/rank_1/Step1/Hybrid_PMF-20240521214809-stdout.txt ```

Nothing shows that PBS terminated the job either. I currently have the same code running on the server but with a different values for D, the above case is when setting D=1. The command qstat -fx <JOBID> yield the comment

comment = Job run at Tue May 21 at 21:47 on (cnode0897:ncpus=24:mem=1572864
    0kb) and finished

Indicating that none of the admins, neither myself terminated the job

Running the job with the same data (and same seed which failed) does not reproduce this error on my device. I have attached a json file with data used. data.json

Current Version:

cluster: cmdstan 2.34.0 hff4ab46_0 conda-forge cmdstanpy 1.2.2 pyhd8ed1ab_0 conda-forge