ucl-pond / pySuStaIn

Subtype and Stage Inference (SuStaIn) algorithm with an example using simulated data.
MIT License
112 stars 62 forks source link

Out of bounds issue in: self._optimise_parameters() #16

Closed rperea14 closed 3 years ago

rperea14 commented 3 years ago

Hi all, After I initialize the ZSuStaIn model, I run into the following Index out of bounds exception:

Finding ML solution to 1 cluster problem
/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/ZscoreSustain.py:238: RuntimeWarning: invalid value encountered in true_divide
  p_perm_k_norm                       = p_perm_k_weighted / np.sum(p_perm_k_weighted, axis=(1,2), keepdims=True)
Traceback (most recent call last):
  File "try0_mspaths_sustain.py", line 151, in <module>
    sustain_input.run_sustain_algorithm()
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 144, in run_sustain_algorithm
    ml_likelihood_mat_EM        = self._estimate_ml_sustain_model_nplus1_clusters(self.__sustainData, ml_sequence_prev_EM, ml_f_prev_EM) #self.__estimate_ml_sustain_model_nplus1_clusters(self.__data, ml_sequence_prev_EM, ml_f_prev_EM)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 552, in _estimate_ml_sustain_model_nplus1_clusters
    ml_likelihood_mat               = self._find_ml(sustainData)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 643, in _find_ml
    pool_output_list                = list(pool_output_list)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 676, in _find_ml_iteration
    _                               = self._perform_em(sustainData, seq_init, f_init)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/AbstractSustain.py", line 828, in _perform_em
    candidate_likelihood            = self._optimise_parameters(sustainData, current_sequence, current_f)
  File "/camhpc/mspaths/data/dev/pySuStaIn/pySuStaIn/ZscoreSustain.py", line 304, in _optimise_parameters
    this_S                      = this_S[0, :]
IndexError: index 0 is out of bounds for axis 0 with size 0

My current ZSustaIn class initialization is as followed:

# Start pySuStaIn
N = 4 # number of biomarkers
SuStaInLabels = ['Bio1', 'Bio2', 'Bio3', 'Bio4'] # biomarker labels

unt_data = np.vstack((biom_1, biom_2, biom_3, biom_4)) # each biomarker z-scored
data = np.transpose(unt_data) # data.shape --> (5123, 4) # data.shape returns (5123, 4)
Z_vals = np.array([[1,2,3]]*N)  # Z-score stage threshold # Z_vals.shape return (4,3)
Z_max = np.array([np.max(biom_bpf), np.max(biom_t2les),
                  np.max(biom_cgmf), np.max(biom_dgmf)]) # Z_max.shape returns (4,)

# Snipper of my prepared data for SuStaIn:
N_S_gt = 3 # Number of ground truth subtypes
N_startpoints = 10
N_S_max = N_S_gt+1
N_iterations_MCMC = int(1e4)
output_folder = os.path.join(os.path.dirname(__file__), 'rp_mspaths_sim')
if os.path.isdir(output_folder) is False:
    os.mkdir(output_folder)
dataset_name = 'sim'
sustain_input = ZscoreSustain(data,
                              Z_vals,
                              Z_max,
                              SuStaInLabels,
                              N_startpoints,
                              N_S_max,
                              N_iterations_MCMC,
                              output_folder,
                              dataset_name,
                              False)
sustain_input.run_sustain_algorithm() ## 

After spending some time attempting to debug, I believe the issue is related to the following functions returning a nan array type: f_opt = (np.squeeze(sum(sum(p_perm_k_norm))) / sum(sum(sum(p_perm_k_norm)))).reshape(N_S, 1, 1)

    def _optimise_parameters(self, sustainData, S_init, f_init):
        # Optimise the parameters of the SuStaIn model

        M                                   = sustainData.getNumSamples()   #data_local.shape[0]
        N_S                                 = S_init.shape[0]
        N                                   = self.stage_zscore.shape[1]

        S_opt                               = S_init.copy()  # have to copy or changes will be passed to S_init
        f_opt                               = np.array(f_init).reshape(N_S, 1, 1)
        f_val_mat                           = np.tile(f_opt, (1, N + 1, M))
        f_val_mat                           = np.transpose(f_val_mat, (2, 1, 0))
        p_perm_k                            = np.zeros((M, N + 1, N_S))

        for s in range(N_S):
            p_perm_k[:, :, s]               = self._calculate_likelihood_stage(sustainData, S_opt[s])

        p_perm_k_weighted                   = p_perm_k * f_val_mat
        p_perm_k_norm                       = p_perm_k_weighted / np.sum(p_perm_k_weighted, axis=(1,2), keepdims=True)
        f_opt                               = (np.squeeze(sum(sum(p_perm_k_norm))) / sum(sum(sum(p_perm_k_norm)))).reshape(N_S, 1, 1)
        f_val_mat                           = np.tile(f_opt, (1, N + 1, M))
        f_val_mat                           = np.transpose(f_val_mat, (2, 1, 0))
        order_seq                           = np.random.permutation(N_S)  # this will produce different random numbers to Matlab
.
.
.

Now, this issue occurs randomly on the nth iteration, as replicated when running my script multiple times. Any insights on how to fix/troubleshoot this problem? Thanks in advance

sea-shunned commented 3 years ago

Hi!

This issue can occur when there is missing data. Easiest way to both check and address this would be to use the ZScoreSustainMissingData variant that was recently merged/released.

Would you be able to try using that instead and see if that fixes it?

ayoung11 commented 3 years ago

As well as whether you have any missing values in your dataset, it would also be useful to know the range of the data for each biomarker and the values of Z_max you used. This error can also occur if there are values in your data that are much larger than Z_max.

rperea14 commented 3 years ago

Thank you @ayoung11, my problem was exactly related to data ranges. One of my columns in my data variable [eg. one input feature] had very high ranges compared to others. After normalizing this column, I was able to successfully run the model.