Create function [data_thinning]

soriadelva commented 5 months ago

data_thinning

Thin the occurrence data (i.e., remove occurrence records closer than a certain cut-off to each other) to partly address sampling bias, will need to define a rationale behind the spatial cutoff to use. The output of this function is a dataset, similar to the one generated by the data_download() function.

Checklist

[x] maak een nieuw R-bestand
[x] sla het R bestand op onder ./R met filenaam is gelijk aan functienaam
[x] voorzie een functie titel met #' op regel 1 van je script
[x] voorzie een auteur met #' @author
[x] voorzie een beschrijving met #' @description
[x] voorzie uitleg over de input parameter(s) met #' @param name
[x] voorzie uitleg over de output van de functie met #' @returns
[x] voorzie minstens 1 voorbeeld van het gebruik van de functie dmv #' @examples
[x] voorzie de export - tag dmv #' @export (#14)
[x] voer usethis::use_package("packagename", min_version = TRUE) uit in de console voor iedere package die je gebruikt.
[x] voer roxygen2::roxygenise() uit in de console
[x] voer devtools::check() uit in de console
[x] los eventuele errors, warnings en notes¹ op
[x] maak een pull request met @soriadelva of @SanderDevisscher en eventueel andere relevante gebruikers als reviewer.

¹in de mate van het mogelijke

soriadelva commented 1 month ago

Dit is een functie die ik vroeger gebruikte (geschreven door Jorge Assis):

spatial.autocorrelation.thinning <- function(occurrence.records,min.distance) {

  coordinates.t <- occurrence.records

  space <- spDists(as.matrix(coordinates.t),as.matrix(coordinates.t),longlat=TRUE)
  diag(space) <- NA

  reclass <- space <= min.distance
  reclass[lower.tri(reclass, diag=TRUE)] <- NA

  v <- colSums(reclass, na.rm=TRUE) == 0
  coordinates.t <- coordinates.t[v,]

  # Number of All occurrences and number to keep

  cat( paste0("\n"))
  cat( paste0("\n"))

  cat( paste0("Input Records: ",nrow(occurrence.records)))
  cat( paste0("\n"))
  cat( paste0("Final Records: ",nrow(coordinates.t)))

  # Remove from main dataset of occurrences

  return(coordinates.t)

}

SanderDevisscher commented 1 month ago

I'm having issues with dplyr::group_by() %>% dplyr::summarise()

SanderDevisscher commented 1 month ago

I had to include a parameter "n" to cope with species with a large number of occurrences (x) since the function creates a pairwise distance matrix (x rows : x columns). I've opted to work with chunks of "n" size in which the distances are compared. This disables dataset wide comparison if no occurrence > n but enables thinning even if the distance matrix is to large to fit in RAM storage. for example a dataset of 111000 occurrences yields a the distance matrix of over 93gb 😱.

I'm currently benchmarking the soft spot between time needed & thinning achieved to set as a default. expected behaviour:

SanderDevisscher commented 1 month ago

benchmark 1:

n_1000 <- system.time(df_thinned <- data_thin(df))
n_1000_nrow <- nrow(df_thinned)
n_2000 <- system.time(df_thinned <- data_thin(df, n = 2000))
n_2000_nrow <- nrow(df_thinned)
n_5000 <- system.time(df_thinned <- data_thin(df, n = 5000))
n_5000_nrow <- nrow(df_thinned)
n_10000 <- system.time(df_thinned <- data_thin(df, n = 10000))
n_10000_nrow <- nrow(df_thinned)
n_20000 <- system.time(df_thinned <- data_thin(df, n = 20000))
n_20000_nrow <- nrow(df_thinned)

benchmark <- data.frame(n = c(1000, 2000, 5000, 10000, 20000),
                        time = c(n_1000[3], n_2000[3], n_5000[3], n_10000[3], n_20000[3]),
                        nrow = c(n_1000_nrow, n_2000_nrow, n_5000_nrow, n_10000_nrow, n_20000_nrow))

ggplot2::ggplot(benchmark, ggplot2::aes(x = n, y = time)) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::labs(title = "Time to thin data", x = "n", y = "Time (s)")

ggplot2::ggplot(benchmark, ggplot2::aes(x = n, y = nrow)) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::labs(title = "Number of rows after thinning", x = "n", y = "Number of rows")

actual behaviour:

=> n == 2000 lijkt het beste, ik ga nu eens benchmarken voor x-tal iteraties

n_2000_1 <- system.time(df_thinned <- data_thin(df, n = 2000))
n_2000_1_nrow <- nrow(df_thinned)
n_2000_2 <- system.time(df_thinned <- data_thin(df_thinned, n = 2000))
n_2000_2_nrow <- nrow(df_thinned)
n_2000_3 <- system.time(df_thinned <- data_thin(df_thinned, n = 2000))
n_2000_3_nrow <- nrow(df_thinned)
n_2000_4 <- system.time(df_thinned <- data_thin(df_thinned, n = 2000))
n_2000_4_nrow <- nrow(df_thinned)
n_2000_5 <- system.time(df_thinned <- data_thin(df_thinned, n = 2000))
n_2000_5_nrow <- nrow(df_thinned)

benchmark_2 <- data.frame(iter = c(1, 2, 3, 4, 5),
                          time = c(n_2000_1[3], n_2000_2[3], n_2000_3[3], n_2000_4[3], n_2000_5[3]),
                          nrow = c(n_2000_1_nrow, n_2000_2_nrow, n_2000_3_nrow, n_2000_4_nrow, n_2000_5_nrow))

ggplot2::ggplot(benchmark_2, ggplot2::aes(x = iter, y = time)) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::labs(title = "Time to thin data", x = "Iteration", y = "Time (s)")

ggplot2::ggplot(benchmark_2, ggplot2::aes(x = iter, y = nrow)) +
  ggplot2::geom_point() +
  ggplot2::geom_line() +
  ggplot2::labs(title = "Number of rows after thinning", x = "Iteration", y = "Number of rows")

SanderDevisscher commented 1 month ago

not much is gained by iterating the function:

however still need to do the following:

[x] convert export to sf
[x] set fun = "sum" as default
[x] do last steps of checklist (roxygenise etc...)

SanderDevisscher commented 1 month ago

the function in action 😎

red = input blue = thinned

inbo / ClimateCastR

Create function [data_thinning] #21

data_thinning

Checklist