magland / isosplit5_python

2 stars 6 forks source link

Warning before abort never gets printed #1

Open eann opened 6 years ago

eann commented 6 years ago

Hi Jeremy I was using mountainsort and I noticed that during the sorting of some data some of the processes seemed to hang without a reason. After some digging I found out that it was due to the presence of duplicate events and the subsequent abort. It took me some time to find out beacuse the relative error message never gets printed as it is written to stdout just before the thread aborts. I linked you below the line where this happens, I solved the issue (and determined the reason of my problems) by substituting printf(...) with fprintf(stderr, ...)

https://github.com/magland/isosplit5_python/blob/a56142d134f3d8707aa3418d310512e09cbe228f/src/isosplit5.cpp#L1036

By the way, how can I deal with the issue of duplicate events?

cheers, Eis

magland commented 6 years ago

Thanks for that find! Would you mind making a pull request for that fix?

Regarding duplicate events... that's a tricky issue. I've thought about it a lot. One very quick fix (but not so elegant) is to add some very small random (or deterministic) perturbation to the data in order to guarantee non-duplicates. This could either be done on input, or within the clustering itself.

Your thoughts?

eann commented 6 years ago

Pull request done!

As for the duplicates, surely adding a small amount of noise would be a quick and automatic solution to the problem. It shouldn't technically affect the result of the clustering, and could be done just after the PCA in ms4. In this way all the spikes should equally feel the effect of the perturbation. Another way could be to simply eliminate duplicates just before running the clustering (e.g. using np.unique) but it's going to be slower and there is the problem of tracking the indices of duplicate spikes.

I don't know how feasible it is, but since the code is C++, another solution could be to add a really small amount of noise only when there are duplicate events. In order to do this, instead of aborting the function should raise an exception and the merge_test function call should lie in a try-catch block; if the code raises an exception because the average covariance matrix is non invertible the two covariance matrices are recomputed after noise is added. In this way only clusters having duplicate events are ever touched, at the cost of a slightly increased processing time (that shouldn't be an issue, as long as the duplicates occur with low probability).

Let me know your opinion!

magland commented 6 years ago

ah, I like your idea of adding some noise only when there are duplicates. That actually could be done on the inner code, so the outer logic (and data) is not affected. BTW, this is only a problem during the parcelate phase (initialization). It should be a relatively quick fix, I believe, if done right.

eann commented 6 years ago

There might also be another technique to solve the issue of duplicate events: when the software is incapable of computing the inverse of the covariance matrix it could switch to the Moore-Penrose inverse instead and use that in the following steps. Since it involved computing the SVD of the covariance matrix it's more expensive than simply adding noise, but it might be more elegant.

What do you think?