tidymodels / spatialsample

Create and summarize spatial resampling objects 🗺
https://spatialsample.tidymodels.org
Other
71 stars 5 forks source link

Have issue running spatial_clustering_cv as it requires Fortran #158

Closed cthwe closed 7 months ago

cthwe commented 8 months ago

I'm having problem with running spatial_clustering_cv for my dataset. So my dataset contains ID, latitude and longitude in 4326 CRS system. I converted it to sf object using this code: cluster <- sf::st_as_sf(extracted, coords=c("lon_4326", "lat_4326"), crs = 4326)

However, I obtained this error when I run spatial clustering cv: df_cluster <- spatial_clustering_cv(cluster, v=5, cluster_function="kmeans")

Error in do_one(nmeth) : long vectors (argument 1) are not supported in .Fortran

Also, is there a way to extract list of ID for each cluster?

mikemahoney218 commented 8 months ago

Hi @cthwe ! Is there any chance you'd be able to provide a reprex for this issue? Unfortunately, given the information you've provided here, I don't have a way to reproduce this bug and start figuring out what's going on here.

My only guess is that this is a bug in kmeans() itself when given a large data set -- my napkin math suggests that if you've got more than 46,000ish rows in your data set, R wouldn't be able to hand the distance matrix for your data over to Fortran. If I'm right and this is a decent sized data set, you might need to use a different cluster function or a different CV method.

Also, is there a way to extract list of ID for each cluster?

Not built-in, but check out the second code chunk here: https://github.com/tidymodels/spatialsample/issues/157#issuecomment-1924660480

cthwe commented 8 months ago

Hi @mikemahoney218 , I reduced the dataset to 1000 rows and ran the analysis and it worked without issue. So you were right in the sample size being the problem. My dataset has about 64000 rows in the dataset. Is there a way to make the code suitable for large dataset as I really need to run it on the whole dataset?

Thanks for the comment link. That seems to be what I was looking for.

mikemahoney218 commented 8 months ago

I'm having no luck reproducing this error on my computer, because even with 64GB of RAM the function winds up running out of memory. This function is going to be extremely resource-intensive for this size of data. The bug you're hitting is then in base R itself, so would require some creative workarounds. Are you able to use a different approach, like spatial_block_cv(), which shouldn't have the same issues?

cthwe commented 7 months ago

spatial_block_cv should work for my analysis. Thanks for the help.

mikemahoney218 commented 7 months ago

Alright, glad to hear it. I'm going to close this issue because I can't currently reproduce it (without just crashing R) and the actual bug appears to not be in spatialsample -- but if anyone reading this in the future has further questions (or fixes), please feel free to open an issue or PR!