pyxem / orix

Analysing crystal orientations and symmetry in Python
https://orix.readthedocs.io
GNU General Public License v3.0
80 stars 48 forks source link

Physical meaning of eps parameter in DBScan #467

Open maclariz opened 9 months ago

maclariz commented 9 months ago

Does anyone have any insight on the physical meaning of the eps parameter in the DBSCAN algorithm and how this relates to angular spread in a cluster?

hakonanes commented 9 months ago

It's the minimal misorientation angle in radians between two points in a cluster. I.e. if two points are further away than eps, they cannot be part of the same cluster.

maclariz commented 9 months ago

So, I just had a look at the usage in the paper I just had reviewer comments on.

eps was set to 15 degrees (converted to radians).

But the 3 sigma value for the deviation of orientations within each cluster varied from 1.4 degrees to 4.2 degrees (i.e. I calculated standard deviation of the angle in the axis angle pair between each quaternion in the cluster and the average orientation of the cluster, which means that 99.7% of orientations in any cluster lie within something between 1.4 and 4.2 degrees from the cluster average). In other words, the actual results are that the clusters are far tighter than the eps threshold. Similar analysis could (and probably should) be done in other cases. I think eps has to be treated as a useful adjustable parameter, but actually it's better to then analyse final results to see the real scatter in the data.

And looking again, the definition suggests that it is the maximal misorientation angle, not minimal.

Best wishes

Ian

hakonanes commented 9 months ago

You are right, it's the maximum distance (sklearn docs). Sorry for being too quick there.

CSSFrancis commented 9 months ago

I've found this wikipedia section to be quite useful: https://en.wikipedia.org/wiki/DBSCAN#Abstract_algorithm

Just a note that there are 2 parameters which are important for DBScan. The eps value which is the maximum distance and the min_samples which actually describes the minimum number of points in some (circle, sphere, hypersphere depending on your dimensionality) for some core cluster to be identified. Then once a core cluster is identified the core clusters are merged and non-core points are added if they are within eps of the edge of a core cluster. In general this number min_samples should be equal greater than the number of dimensions+1. Actually n+1 is a good place to start.

For eps there are a couple of different ways described here which might be of interest. https://en.wikipedia.org/wiki/DBSCAN#Parameter_estimation. Additionally, methods like OPTICS might be of use which tend to be better if you have clusters which are of varying densities.

pc494 commented 9 months ago

Hi Ian, sorry I'm a bit late to this one.

When we wrote Density Based Clustering ... Johnstone et al. we thought about this a bit and in the end decided to start with a number that accurately answers the question:

"Given our (perceived) measurement errors, what's the furthest away two data points (i.e the smallest rotation that maps A to B) could be where you could still convince us they belonged in the same cluster"

which for our dataset was 0.05 (i.e. ~ 3 degrees) given the high-quality EBSD map we had at hand. Then run the algorithms, inspect the real/orientation space maps, and see where that leaves you.

15 degrees does seem a bit too generous though unless you're working with really really noisy data. You may well end up merging two physically distinct clusters if you're unlucky.

maclariz commented 9 months ago

@pc494 took a break for the holidays there. I initially took a naive view that a small misorientation should be sensible, but on some datasets merely ended up with lots of clusters that were all versions of the same thing, with just minor misorientations between. Even with the 15 degree criterion, this dataset splits some laths of same orientation into two clusters, probably due to some sample bending. It is trivial to see they belong to the same cluster in a pole figure, however. A minor issue which may influence this, but unproven at this point, is that any orientation has two possible habit planes and the reality of samples could mean that there is a slight misorientation between two laths with the two different habit planes.

On this specific dataset I played with eps and found that reducing to 3 degrees (in radians) made no difference to results and exactly the same clusters were found. I played with min_samples and found that increasing this from 40 to 100 just deleted the smallest cluster and that lath, with no other change, so that was unhelpful. Decreasing from 40 to 10 found one extra cluster that was a variant indexing of an existing cluster with a slight tilt and just filled in a few points around the edges of one of the laths, so no significant benefit to overall interpretation.

With the benefit of this, I will revise paper text slightly and update the eps parameter seeing as the smaller one works here.

Ian