add sample data for running examples

bbest commented 2 years ago

Peter's example in iobis/notebook-diversity-indicators: index.Rmd#L23 uses the parquet file downloadable at https://obis.org/data/access/ and currently obis_20220404.parquet is 12.8 GB. We need something else to use in the vignette and function examples that can be locally loaded within the R package.

bbest commented 2 years ago

Hi @7yl4r or @albenson-usgs, should we pick a global subsample or regional or perhaps both?

Running now to see if raw data files could be small enough for Github (< 50 MB):

librarian::shelf(arrow, dplyr, here, readr)

occ <- open_dataset(here("data/obis_20220404.parquet"))
occ <- occ %>%
  select(decimalLongitude, decimalLatitude, species) %>%
  group_by(decimalLongitude, decimalLatitude, species) %>%
  collect() %>%
  summarize(records = n())
write_csv(occ, "data/occ.csv")

# write full occurrence dataset
write_parquet(occ, "data/occ.parquet") # 255 MB

# write subsampled global occurrence dataset
set.seed(42)
i <- sample(1:nrow(occ), 1000000)
occ %>% 
  slice(i) %>% 
  write_parquet("data/occ_1M.parquet") # ? MB

# write full regional occurrence dataset
# [Marine Regions · South Atlantic Ocean (IHO Sea Area)](https://marineregions.org/gazetteer.php?p=details&id=1914)
# Min. Lat  60° 0' 0" S (-60°)  
# Min. Long 69° 36' 3" W (-69.6008°)  
# Max. Lat  0° 4' 30.4" N (0.0751°)  
# Max. Long 20° 0' 32.6" E (20.0091°)  
occ %>% 
  filter(
    lat >= -60,
    lon >= -69.6008,
    lat <= 0.0751,
    lon <= 20.0091) %>% 
  sample_n(1000000) %>% 
  write_parquet(occ, "data/occ_SAtlantic.parquet") # ? MB

albenson-usgs commented 2 years ago

If it works to run globally seems like we should just do that and worry about subsampling if we need to?

albenson-usgs commented 2 years ago

Sorry should have read the original comment first. I skipped to the one you tagged me in. If you need to subsample then I guess go ahead. I want to run the code locally on the full global dataset first to get a sense of what's happening. Some discussion going on in the room here points to doing a spatial subsample. Does that help?

bbest commented 2 years ago

Yeah, that's great, thanks @albenson-usgs!

I think this will work:

global subsample 1,000,000 records, 16.9 MB
South Atlantic Ocean (IHO Sea Area) chosen since somewhat sparsely populated and provides a latitudinal gradient 1,014,006 records; 8 MB

librarian::shelf(arrow, dplyr, here, readr)

occ <- open_dataset(here("data/obis_20220404.parquet")) # 12.82 GB
occ <- occ %>%
  select(decimalLongitude, decimalLatitude, species) %>%
  group_by(decimalLongitude, decimalLatitude, species) %>%
  collect() %>%
  summarize(records = n())
write_csv(occ, "data/occ.csv")

occ <- occ %>% 
  ungroup()

# write full occurrence dataset
write_parquet(occ, "data/occ.parquet") # 27,965,153 × 4; 255 MB

# write subsampled global occurrence dataset
set.seed(42)
i <- sample(1:nrow(occ), 1000000)
occ %>% 
  slice(i) %>% 
  write_parquet("data/occ_1M.parquet") # 1,000,000 x 4; 16.9 MB

# write full regional occurrence dataset
# [Marine Regions · South Atlantic Ocean (IHO Sea Area)](https://marineregions.org/gazetteer.php?p=details&id=1914)
# Min. Lat  60° 0' 0" S (-60°)  
# Min. Long 69° 36' 3" W (-69.6008°)  
# Max. Lat  0° 4' 30.4" N (0.0751°)  
# Max. Long 20° 0' 32.6" E (20.0091°)  
occ %>% 
  filter(
    decimalLatitude  >= -60,
    decimalLongitude >= -69.6008,
    decimalLatitude  <= 0.0751,
    decimalLongitude <= 20.0091) %>% 
  write_parquet("data/occ_SAtlantic.parquet") # 1,014,006 × 4; 8 MB
#read_parquet("data/occ_SAtlantic.parquet")

bbest commented 2 years ago

Added with commit https://github.com/marinebon/obisindicators/commit/8d84de41fae523fbf60a8d8fb8d474daf21faea4

marinebon / obisindicators

add sample data for running examples #4