mpadge / spatialcluster

spatially-constrained clustering in R
https://mpadge.github.io/spatialcluster/
30 stars 6 forks source link

A problem in rcpp_slk #26

Closed geniusadventurer closed 1 year ago

geniusadventurer commented 2 years ago

Hi Mark (and such a coincidence that my English name is also Mark!) I'm trying to input my own data to the scl_redcap function, but it doesn't work. When I enter

scl <- scl_redcap (xy, dmat, ncl = 30, linkage = "single")

it shows:

Error in rcpp_slk(edges_all, edges_nn, shortest) : Index out of bounds: [index=5391684; extent=5391684].

I found a bug fix in #20 to fix rcpp_slk, but the fixed function seems still doesn't work for my data. In my data, there are 2322 rows and 2 columns in xy and 2322 rows and 2322 columns in dmat. In fact, I import them from csv files to matrices.

An interesting thing is that when I use random xy data, like this:

xy <- matrix (runif (2 * 2322), ncol = 2)

this function works. Also when I use python numpy to generate a random 2322*2 matrix, it works well. But when it comes to my data, it fails. I have no idea about this problem. In fact these points are the coordinates of some points in a city. It's so weird!

I really need to use the REDCAP algorithm. So what's the problem and how can I fix it? Thank you!

mpadge commented 2 years ago

Thanks for asking @geniusadventurer! That should be pretty simple to solve, but i'd need your data to be able to investigate further. Can you either provide a link, or just bundle up the data and drag-and-drop here? Easiest would be to drop both your xy and dmat objects exactly as submitted to the scl_redcap routine. Then i'll investifate further.

geniusadventurer commented 2 years ago

The data is bigger than 52 MB so I put it on Google Drive. https://drive.google.com/drive/folders/1mXFuFL-RytnQ6KwzLCejb-A49JnHddOW?usp=sharing Can you download it? In distance_matrix.csv, a 2322 2322 matrix is in it. In xy.csv, a 2322 2 matrix is in it. The first column represents x coordinate and the other y coordinate. No header and index in both of them. Actually I converted them to matrix using as.matrix function in R, like this:

xy <- read.csv('xy.csv', header=FALSE)
xy <- as.matrix(xy)
mpadge commented 1 year ago

Thansk, should work now.

geniusadventurer commented 1 year ago

Thank you for fixing this problem but there's another problem when I use my data. Still those data I put in Google Drive. Now the problem is:

error in dplyr::bind_cols(tree_nodes(tree), xy) (redcap.R#19): Can't recycle '..1' (size 2317) to match '..2' (size 2322).

When I debug into tree_nodes, it shows res has only 2317 rows, so they can't match. Do you find this bug after debugging the previous problem? Now I'm using R 4.0.3 on a Windows 11 X64 PC.

mpadge commented 1 year ago

Hmmm. I don't see that, but it does make sense given the above commit. I'll re-open and look a bit further

mpadge commented 1 year ago

Thanks @geniusadventurer, that was actually a pretty critical bug that has now been fixed by the above commits. It all works well on my local machine - and is even notably faster than before those changes. Let me know how you go

mpadge commented 1 year ago

This is what the results on your data look like:

You'll have to make sense of that one :smile:

geniusadventurer commented 1 year ago

Now it also works well on my computer! Thank you so much!

mpadge commented 1 year ago

Thanks for providing an opportunity for me to dive back in to this package. I want to try to get it on CRAN as soon as i can.

geniusadventurer commented 1 year ago

Look forward to it! This project is very helpful to my research. But I'm interested in this result: why do the clusters look like not clustering together spatially? Since the REDCAP algorithm is actually a spatially constrained clustering algorithm, or let's say a regionalization algorithm, from my view. Maybe it's due to my data lol

mpadge commented 1 year ago

Yeah, i was wondering that too, so looked a bit into your data. Spatially-delineated clusters will only really arise when the distances in dmat themselves have some kind of spatial structure, but in your case there is no relationship whatsoever between xy and dmat. So the clusters you see are determined very strongly by dmat, whatever that is, and so are really not spatially structured at all.

geniusadventurer commented 1 year ago

Maybe I need to check my data. Thank you for your explanation!

geniusadventurer commented 1 year ago

Well, I think there may be more problems... When I use the code you provide in README.md:

set.seed (1)
n <- 100
xy <- matrix (runif (2 * n), ncol = 2)
dmat <- matrix (runif (n ^ 2), ncol = n)
scl <- scl_redcap (xy, dmat, ncl = 8, linkage = "single")
plot (scl)

It shows like this: Rplot

mpadge commented 1 year ago

@geniusadventurer Yeah, thanks. I had a deeper look yesterday, and also realised that something was awry there. I also found out exactly what the problem is, and will address it in #27. I'll ping you there so you'll be notified of fix.