Closed caseyjlaw closed 5 years ago
That is what I was doing in my original clustering tests as well:
time_ind = np.multiply(integration_ind,np.power(2,dt_ind))
Its better to cluster candidates by the time of occurrence (with respect to the lowest dt), which would be integration*2^(dt_ind), as integration also scales according to dt.
Added a downsample
argument to rfpipe.candidates.cluster_candidates
. Default is to downsample by 2, which should help cluster things that fall at pixel boundaries (spatially, especially).
@KshitijAggarwal I am not getting better clustering with this downsampling feature. I wonder if you have an opinion.
I am implementing a simple integer division of the data array going into hdbscan. So where it has a fit(data)
method, I am instead doing fit(data//downsample)
. For downsample=2
, it will do integer division, so [0,1,2,3]
goes to [0,0,1,1]
.
The problem is that I end up with many more clusters after downsampling. My naive assumption was that the values would be more similar and thus more clustered. The opposite seems to be true.
Any ideas why that happens?
I am not sure. Are you downsampling in time, or frequency or in sky position? I would have expected it to generate lesser clusters (wrt a larger sized image) if the downsampling was done in sky position. If it is done in time and frequency, then I am not sure, and maybe visualizing clustering could help?
I tried two things. First, downsampling all the input values (so pixel location, time, and DM) and second pixel location only. In both cases, I get way more clusters than without downsampling.
I suspect that downsampling produces more candidates that have identical locations, since the downsampling will make neighbors off by one unit (e.g., pixels 0 and 1) be equal. That makes the hamming metric equal to 0, whereas before the smallest distance was probably 1.
I'm thinking that the clustering algorithm thinks that those distance=0
should be clustered more than ones with distance=1
. That might split them up more than without downsampling.
I see. That could be a reason. Do you happen to have some plots for the same? It would be easier to interpret from there ... otherwise I can submit the pull request with the clustering_visualization function to generate those.
Another idea is to play with the way the distances are calculated.
Currently we use metric='hamming'
, because it seemed to work. We could use a different metric or calculate the distance ourselves. It could be calculated with the hamming metric, but we can weight the columns to give more emphasis to some (e.g., x/y distance is weighted less than dm or t).
I'm going to close this, since it really is a question of using the best metric. Rounding is just a workaround and it seems like a bad way to do it.
The time index of a candidate is scaled by dt. Sometimes candidates are found over a range of dt and calculating an index to compare against dt=1 leads to errors on the order of dt (e.g., dt=8 means multiplying integration by 8 and comparing to cands with dt=1). Clustering can miss candidates with large dt due to this misalignment in time. One possibility is to round indexes by factors of 2 or more (perhaps up to max dt) such that more candidates have the same time index during clustering. This should only be done to help clustering group these events; we don't want to save rounded indexes.