Closed 88vikram closed 3 years ago
I've seen something like this. My guess: an array is an int
— maybe you've used too many subtypes?
If this is indeed the cause, then it would be good to handle this.
Hi Neil, I used 4 subtypes, but the dataset is relatively small. I can try to run it for 2 subtypes for now, but it would be nice if the code can handle it automatically when a user gives too many subtypes.
Indeed @88vikram — we've also come across this issue when cross-validating but it's not easily reproducible. Suspect that its to do with the random partitions, which is related to my "too many subtypes" guess.
For now, all I can promise is that we're aware of it and will work to fix it, but I have no timeline for you.
Thanks Vikram for raising this issue!
I've reproduced the error, and traced it back. Neil is correct that it's related to the random partitions. In AbstractSustain._find_ml_split_iteration
, if cluster_assignment
consists of only 1 value then sustainData.reindex(index_s)
results in an empty array, which is then propagated forward and leads to the TypeError
.
In my scenario, cluster_assignment
consisted of only 2 points, so this was inevitable. Currently, in AbstractSustain._estimate_ml_sustain_model_nplus1_clusters
it seems that if this_N_cluster > 1
is what is permitted to be split.
To fix this, one could either increase the cluster size threshold, or resample cluster_assignment
until it contains more than one unique value. The latter is meant to happen (here), but potential error is noted in a comment. The following code fixes that issue:
while min_N_cluster == 0:
cluster_assignment = np.ceil(N_S * np.random.rand(sustainData.getNumSamples())).astype(int)
# Count cluster sizes
# Ignore 0s count with [1:]
# Guarantee 1s and 2s counts with minlength=3
cluster_sizes = np.bincount(cluster_assignment, minlength=3)[1:]
# Get the minimum cluster size
min_N_cluster = cluster_sizes.min()
Can merge that if Neil etc. happy with the fix.
Thanks for finding the issue. I think lines 730-733 in AbstractSuStaIn should be
for s in range(1, N_S + 1): temp_N_cluster[s-1] = np.sum((cluster_assignment == s).astype(int), 0) min_N_cluster = min(temp_N_cluster)
Would be great if you can check if that works on your test case and fixes your error Vikram and I'll update the code.
Apologies if it wasn't clear, but the code in my previous comment does that fix, but using numpy
functions to avoid for
loops (for efficiency).
Thanks, I did see that, just thought it was simpler to just fix the missing index.
Happy for you to update with your fix if you like.
Hi Leon, Peter, others,
I tried running Sustain on a sporadic AD dataset and it ran well for around 8 hours or so until it crashed with the following error. It seems like some corner case scenario which doesn't occur quite often. Would really appreciate your help in addressing this issue.
Thanks in advance.
Vikram