ropensci / CoordinateCleaner

Automated flagging of common spatial and temporal errors in biological and palaeontological collection data, for the use in conservation, ecology and palaeontology.
https://docs.ropensci.org/CoordinateCleaner/
79 stars 21 forks source link

c_outl looping over many species #15

Closed HMB3 closed 5 years ago

HMB3 commented 5 years ago

Hi,

My name is Hugh and I've got a big data set of occurrences from GBIF and ALA. There are about 3.8k species, so lots of points to clean!

I'm using the CleanCoordinates function with these settings ::

`library(CoordinateCleaner) minages <- runif(250, 0, 65) exmpl <- data.frame(species = sample(letters, size = 250, replace = TRUE), decimallongitude = runif(250, min = 42, max = 51), decimallatitude = runif(250, min = -26, max = -11), min_ma = minages, max_ma = minages + runif(250, 0.1, 65), dataset = "clean")

exmpl <- exmpl %>% timetk::tk_tbl()

FLAGS <- CleanCoordinates(exmpl, capitals.rad = 0.12, countrycheck = TRUE, duplicates = TRUE, seas = FALSE, verbose = FALSE)`

However, running the spatial outlier detection here is a bit slow, because there are too many records.

So I'm running the outlier detection separately in this form :

`SPAT.OUT <- as.character(unique(exmpl$species)) %>%

lapply(function(x) {

    f <- subset(exmpl, species == x)

    message("Running spatial outlier detection for ", x)
    message(dim(f)[1], " records for ", x)

    sp.flag <- cc_outl(f,
                       lon     = "decimallongitude",
                       lat     = "decimallatitude",
                       species = "species",
                       method  = "distance",
                       tdi     = 300,  ## get points 300km from other points?
                       value   = "flags",
                       verbose = "FALSE")

    d = cbind(searchTaxon = x,
              SPAT_OUT = sp.flag, f)[c("searchTaxon", "SPAT_OUT")]
    return(d)

}) %>%

bind_rows`

I can run this check for about 1000 species at a time, but it uses a lot of RAM (>60 GB) in some cases. It also doesn't seem to flag many records for some species.

Are there better settings I could use? I need to keep the format of looping over all the species.

Thanks very much - any advice is greatly appreciated!

h

HMB3 commented 5 years ago

Sorry I can't quite figure out the formatting :]