Open eann opened 6 years ago
Thanks for that find! Would you mind making a pull request for that fix?
Regarding duplicate events... that's a tricky issue. I've thought about it a lot. One very quick fix (but not so elegant) is to add some very small random (or deterministic) perturbation to the data in order to guarantee non-duplicates. This could either be done on input, or within the clustering itself.
Your thoughts?
Pull request done!
As for the duplicates, surely adding a small amount of noise would be a quick and automatic solution to the problem. It shouldn't technically affect the result of the clustering, and could be done just after the PCA in ms4. In this way all the spikes should equally feel the effect of the perturbation. Another way could be to simply eliminate duplicates just before running the clustering (e.g. using np.unique) but it's going to be slower and there is the problem of tracking the indices of duplicate spikes.
I don't know how feasible it is, but since the code is C++, another solution could be to add a really small amount of noise only when there are duplicate events. In order to do this, instead of aborting the function should raise an exception and the merge_test function call should lie in a try-catch block; if the code raises an exception because the average covariance matrix is non invertible the two covariance matrices are recomputed after noise is added. In this way only clusters having duplicate events are ever touched, at the cost of a slightly increased processing time (that shouldn't be an issue, as long as the duplicates occur with low probability).
Let me know your opinion!
ah, I like your idea of adding some noise only when there are duplicates. That actually could be done on the inner code, so the outer logic (and data) is not affected. BTW, this is only a problem during the parcelate phase (initialization). It should be a relatively quick fix, I believe, if done right.
There might also be another technique to solve the issue of duplicate events: when the software is incapable of computing the inverse of the covariance matrix it could switch to the Moore-Penrose inverse instead and use that in the following steps. Since it involved computing the SVD of the covariance matrix it's more expensive than simply adding noise, but it might be more elegant.
What do you think?
Hi Jeremy I was using mountainsort and I noticed that during the sorting of some data some of the processes seemed to hang without a reason. After some digging I found out that it was due to the presence of duplicate events and the subsequent abort. It took me some time to find out beacuse the relative error message never gets printed as it is written to stdout just before the thread aborts. I linked you below the line where this happens, I solved the issue (and determined the reason of my problems) by substituting printf(...) with fprintf(stderr, ...)
https://github.com/magland/isosplit5_python/blob/a56142d134f3d8707aa3418d310512e09cbe228f/src/isosplit5.cpp#L1036
By the way, how can I deal with the issue of duplicate events?
cheers, Eis