Closed bbest closed 2 years ago
Hi @7yl4r or @albenson-usgs, should we pick a global subsample or regional or perhaps both?
Running now to see if raw data files could be small enough for Github (< 50 MB):
librarian::shelf(arrow, dplyr, here, readr)
occ <- open_dataset(here("data/obis_20220404.parquet"))
occ <- occ %>%
select(decimalLongitude, decimalLatitude, species) %>%
group_by(decimalLongitude, decimalLatitude, species) %>%
collect() %>%
summarize(records = n())
write_csv(occ, "data/occ.csv")
# write full occurrence dataset
write_parquet(occ, "data/occ.parquet") # 255 MB
# write subsampled global occurrence dataset
set.seed(42)
i <- sample(1:nrow(occ), 1000000)
occ %>%
slice(i) %>%
write_parquet("data/occ_1M.parquet") # ? MB
# write full regional occurrence dataset
# [Marine Regions · South Atlantic Ocean (IHO Sea Area)](https://marineregions.org/gazetteer.php?p=details&id=1914)
# Min. Lat 60° 0' 0" S (-60°)
# Min. Long 69° 36' 3" W (-69.6008°)
# Max. Lat 0° 4' 30.4" N (0.0751°)
# Max. Long 20° 0' 32.6" E (20.0091°)
occ %>%
filter(
lat >= -60,
lon >= -69.6008,
lat <= 0.0751,
lon <= 20.0091) %>%
sample_n(1000000) %>%
write_parquet(occ, "data/occ_SAtlantic.parquet") # ? MB
If it works to run globally seems like we should just do that and worry about subsampling if we need to?
Sorry should have read the original comment first. I skipped to the one you tagged me in. If you need to subsample then I guess go ahead. I want to run the code locally on the full global dataset first to get a sense of what's happening. Some discussion going on in the room here points to doing a spatial subsample. Does that help?
Yeah, that's great, thanks @albenson-usgs!
I think this will work:
librarian::shelf(arrow, dplyr, here, readr)
occ <- open_dataset(here("data/obis_20220404.parquet")) # 12.82 GB
occ <- occ %>%
select(decimalLongitude, decimalLatitude, species) %>%
group_by(decimalLongitude, decimalLatitude, species) %>%
collect() %>%
summarize(records = n())
write_csv(occ, "data/occ.csv")
occ <- occ %>%
ungroup()
# write full occurrence dataset
write_parquet(occ, "data/occ.parquet") # 27,965,153 × 4; 255 MB
# write subsampled global occurrence dataset
set.seed(42)
i <- sample(1:nrow(occ), 1000000)
occ %>%
slice(i) %>%
write_parquet("data/occ_1M.parquet") # 1,000,000 x 4; 16.9 MB
# write full regional occurrence dataset
# [Marine Regions · South Atlantic Ocean (IHO Sea Area)](https://marineregions.org/gazetteer.php?p=details&id=1914)
# Min. Lat 60° 0' 0" S (-60°)
# Min. Long 69° 36' 3" W (-69.6008°)
# Max. Lat 0° 4' 30.4" N (0.0751°)
# Max. Long 20° 0' 32.6" E (20.0091°)
occ %>%
filter(
decimalLatitude >= -60,
decimalLongitude >= -69.6008,
decimalLatitude <= 0.0751,
decimalLongitude <= 20.0091) %>%
write_parquet("data/occ_SAtlantic.parquet") # 1,014,006 × 4; 8 MB
#read_parquet("data/occ_SAtlantic.parquet")
Peter's example in
iobis/notebook-diversity-indicators
:index.Rmd#L23
uses the parquet file downloadable at https://obis.org/data/access/ and currentlyobis_20220404.parquet
is 12.8 GB. We need something else to use in the vignette and function examples that can be locally loaded within the R package.