Open soriadelva opened 5 months ago
Dit is een functie die ik vroeger gebruikte (geschreven door Jorge Assis):
spatial.autocorrelation.thinning <- function(occurrence.records,min.distance) {
coordinates.t <- occurrence.records
space <- spDists(as.matrix(coordinates.t),as.matrix(coordinates.t),longlat=TRUE)
diag(space) <- NA
reclass <- space <= min.distance
reclass[lower.tri(reclass, diag=TRUE)] <- NA
v <- colSums(reclass, na.rm=TRUE) == 0
coordinates.t <- coordinates.t[v,]
# Number of All occurrences and number to keep
cat( paste0("\n"))
cat( paste0("\n"))
cat( paste0("Input Records: ",nrow(occurrence.records)))
cat( paste0("\n"))
cat( paste0("Final Records: ",nrow(coordinates.t)))
# Remove from main dataset of occurrences
return(coordinates.t)
}
I'm having issues with dplyr::group_by() %>% dplyr::summarise()
I had to include a parameter "n" to cope with species with a large number of occurrences (x) since the function creates a pairwise distance matrix (x rows : x columns). I've opted to work with chunks of "n" size in which the distances are compared. This disables dataset wide comparison if no occurrence > n
but enables thinning even if the distance matrix is to large to fit in RAM storage. for example a dataset of 111000 occurrences yields a the distance matrix of over 93gb 😱.
I'm currently benchmarking the soft spot between time needed & thinning achieved to set as a default. expected behaviour:
benchmark 1:
n_1000 <- system.time(df_thinned <- data_thin(df))
n_1000_nrow <- nrow(df_thinned)
n_2000 <- system.time(df_thinned <- data_thin(df, n = 2000))
n_2000_nrow <- nrow(df_thinned)
n_5000 <- system.time(df_thinned <- data_thin(df, n = 5000))
n_5000_nrow <- nrow(df_thinned)
n_10000 <- system.time(df_thinned <- data_thin(df, n = 10000))
n_10000_nrow <- nrow(df_thinned)
n_20000 <- system.time(df_thinned <- data_thin(df, n = 20000))
n_20000_nrow <- nrow(df_thinned)
benchmark <- data.frame(n = c(1000, 2000, 5000, 10000, 20000),
time = c(n_1000[3], n_2000[3], n_5000[3], n_10000[3], n_20000[3]),
nrow = c(n_1000_nrow, n_2000_nrow, n_5000_nrow, n_10000_nrow, n_20000_nrow))
ggplot2::ggplot(benchmark, ggplot2::aes(x = n, y = time)) +
ggplot2::geom_point() +
ggplot2::geom_line() +
ggplot2::labs(title = "Time to thin data", x = "n", y = "Time (s)")
ggplot2::ggplot(benchmark, ggplot2::aes(x = n, y = nrow)) +
ggplot2::geom_point() +
ggplot2::geom_line() +
ggplot2::labs(title = "Number of rows after thinning", x = "n", y = "Number of rows")
actual behaviour:
=> n == 2000 lijkt het beste, ik ga nu eens benchmarken voor x-tal iteraties
n_2000_1 <- system.time(df_thinned <- data_thin(df, n = 2000))
n_2000_1_nrow <- nrow(df_thinned)
n_2000_2 <- system.time(df_thinned <- data_thin(df_thinned, n = 2000))
n_2000_2_nrow <- nrow(df_thinned)
n_2000_3 <- system.time(df_thinned <- data_thin(df_thinned, n = 2000))
n_2000_3_nrow <- nrow(df_thinned)
n_2000_4 <- system.time(df_thinned <- data_thin(df_thinned, n = 2000))
n_2000_4_nrow <- nrow(df_thinned)
n_2000_5 <- system.time(df_thinned <- data_thin(df_thinned, n = 2000))
n_2000_5_nrow <- nrow(df_thinned)
benchmark_2 <- data.frame(iter = c(1, 2, 3, 4, 5),
time = c(n_2000_1[3], n_2000_2[3], n_2000_3[3], n_2000_4[3], n_2000_5[3]),
nrow = c(n_2000_1_nrow, n_2000_2_nrow, n_2000_3_nrow, n_2000_4_nrow, n_2000_5_nrow))
ggplot2::ggplot(benchmark_2, ggplot2::aes(x = iter, y = time)) +
ggplot2::geom_point() +
ggplot2::geom_line() +
ggplot2::labs(title = "Time to thin data", x = "Iteration", y = "Time (s)")
ggplot2::ggplot(benchmark_2, ggplot2::aes(x = iter, y = nrow)) +
ggplot2::geom_point() +
ggplot2::geom_line() +
ggplot2::labs(title = "Number of rows after thinning", x = "Iteration", y = "Number of rows")
not much is gained by iterating the function:
however still need to do the following:
the function in action 😎
red = input blue = thinned
data_thinning
Thin the occurrence data (i.e., remove occurrence records closer than a certain cut-off to each other) to partly address sampling bias, will need to define a rationale behind the spatial cutoff to use. The output of this function is a dataset, similar to the one generated by the data_download() function.
Checklist
./R
met filenaam is gelijk aan functienaam#'
op regel 1 van je script#' @author
#' @description
#' @param name
#' @returns
#' @examples
#' @export
(#14)usethis::use_package("packagename", min_version = TRUE)
uit in de console voor iedere package die je gebruikt.roxygen2::roxygenise()
uit in de consoledevtools::check()
uit in de console1in de mate van het mogelijke