ucl-pond / pySuStaIn

Subtype and Stage Inference (SuStaIn) algorithm with an example using simulated data.
MIT License
125 stars 64 forks source link

TypeError: 'int' object is not iterable in ZscoreSustain #17

Closed 88vikram closed 3 years ago

88vikram commented 3 years ago

Hi Leon, Peter, others,

I tried running Sustain on a sporadic AD dataset and it ran well for around 8 hours or so until it crashed with the following error. It seems like some corner case scenario which doesn't occur quite often. Would really appreciate your help in addressing this issue.

Splitting cluster 1 of 3
 + Resolving 2 cluster problem
 + Finding ML solution from hierarchical initialisation
- ML likelihood is [-4233.34529467]
Splitting cluster 2 of 3
 + Resolving 2 cluster problem
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-bce0b47cba4f> in <module>
     65                               dataset_name,False)
     66 
---> 67 sustain_input.run_sustain_algorithm()

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in run_sustain_algorithm(self)
    142                 ml_sequence_mat_EM, \
    143                 ml_f_mat_EM,        \
--> 144                 ml_likelihood_mat_EM        = self._estimate_ml_sustain_model_nplus1_clusters(self.__sustainData, ml_sequence_prev_EM, ml_f_prev_EM) #self.__estimate_ml_sustain_model_nplus1_clusters(self.__data, ml_sequence_prev_EM, ml_f_prev_EM)
    145 
    146                 seq_init                    = ml_sequence_EM

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in _estimate_ml_sustain_model_nplus1_clusters(self, sustainData, ml_sequence_prev, ml_f_prev)
    584 
    585                     print(' + Resolving 2 cluster problem')
--> 586                     this_ml_sequence_split, _, _, _, _, _ = self._find_ml_split(sustainData_i)
    587 
    588                     # Use the two subtype model combined with the other subtypes to

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in _find_ml_split(self, sustainData)
    695 
    696         if ~isinstance(pool_output_list, list):
--> 697             pool_output_list                = list(pool_output_list)
    698 
    699         ml_sequence_mat                     = np.zeros((N_S, sustainData.getNumStages(), self.N_startpoints))

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in _find_ml_split_iteration(self, sustainData, seed_num)
    740 
    741             temp_seq_init                   = self._initialise_sequence(sustainData)
--> 742             seq_init[s, :], _, _, _, _, _   = self._perform_em(temp_sustainData, temp_seq_init, [1])
    743 
    744         f_init                              = np.array([1.] * N_S) / float(N_S)

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/AbstractSustain.py in _perform_em(self, sustainData, current_sequence, current_f)
    826             candidate_sequence,     \
    827             candidate_f,            \
--> 828             candidate_likelihood            = self._optimise_parameters(sustainData, current_sequence, current_f)
    829 
    830             HAS_converged                   = np.fabs((candidate_likelihood - current_likelihood) / max(candidate_likelihood, current_likelihood)) < 1e-6

~/anaconda3/lib/python3.8/site-packages/pySuStaIn/ZscoreSustain.py in _optimise_parameters(self, sustainData, S_init, f_init)
    237         p_perm_k_weighted                   = p_perm_k * f_val_mat
    238         p_perm_k_norm                       = p_perm_k_weighted / np.sum(p_perm_k_weighted, axis=(1,2), keepdims=True)
--> 239         f_opt                               = (np.squeeze(sum(sum(p_perm_k_norm))) / sum(sum(sum(p_perm_k_norm)))).reshape(N_S, 1, 1)
    240         f_val_mat                           = np.tile(f_opt, (1, N + 1, M))
    241         f_val_mat                           = np.transpose(f_val_mat, (2, 1, 0))

TypeError: 'int' object is not iterable

Thanks in advance.

Vikram

noxtoby commented 3 years ago

I've seen something like this. My guess: an array is an int — maybe you've used too many subtypes?

If this is indeed the cause, then it would be good to handle this.

88vikram commented 3 years ago

Hi Neil, I used 4 subtypes, but the dataset is relatively small. I can try to run it for 2 subtypes for now, but it would be nice if the code can handle it automatically when a user gives too many subtypes.

noxtoby commented 3 years ago

Indeed @88vikram — we've also come across this issue when cross-validating but it's not easily reproducible. Suspect that its to do with the random partitions, which is related to my "too many subtypes" guess.

For now, all I can promise is that we're aware of it and will work to fix it, but I have no timeline for you.

sea-shunned commented 3 years ago

Thanks Vikram for raising this issue!

I've reproduced the error, and traced it back. Neil is correct that it's related to the random partitions. In AbstractSustain._find_ml_split_iteration, if cluster_assignment consists of only 1 value then sustainData.reindex(index_s) results in an empty array, which is then propagated forward and leads to the TypeError.

In my scenario, cluster_assignment consisted of only 2 points, so this was inevitable. Currently, in AbstractSustain._estimate_ml_sustain_model_nplus1_clusters it seems that if this_N_cluster > 1 is what is permitted to be split.

To fix this, one could either increase the cluster size threshold, or resample cluster_assignment until it contains more than one unique value. The latter is meant to happen (here), but potential error is noted in a comment. The following code fixes that issue:

while min_N_cluster == 0:
    cluster_assignment = np.ceil(N_S * np.random.rand(sustainData.getNumSamples())).astype(int)
    # Count cluster sizes
    # Ignore 0s count with [1:]
    # Guarantee 1s and 2s counts with minlength=3
    cluster_sizes = np.bincount(cluster_assignment, minlength=3)[1:]
    # Get the minimum cluster size
    min_N_cluster = cluster_sizes.min()

Can merge that if Neil etc. happy with the fix.

ayoung11 commented 3 years ago

Thanks for finding the issue. I think lines 730-733 in AbstractSuStaIn should be

for s in range(1, N_S + 1): temp_N_cluster[s-1] = np.sum((cluster_assignment == s).astype(int), 0) min_N_cluster = min(temp_N_cluster)

Would be great if you can check if that works on your test case and fixes your error Vikram and I'll update the code.

sea-shunned commented 3 years ago

Apologies if it wasn't clear, but the code in my previous comment does that fix, but using numpy functions to avoid for loops (for efficiency).

ayoung11 commented 3 years ago

Thanks, I did see that, just thought it was simpler to just fix the missing index.

ayoung11 commented 3 years ago

Happy for you to update with your fix if you like.