moiexpositoalonsolab / deepbiosphere

MIT License
15 stars 1 forks source link

Consider using the parquet snapshots of GBIF? #2

Open cboettig opened 2 weeks ago

cboettig commented 2 weeks ago

Fantastic project here, and congrats @leg2015 and team on the paper.

Just a note, GBIF monthly snapshots are available as partitioned parquet files from https://registry.opendata.aws/gbif/, which can be faster than hitting GBIF's own API.

e.g. in python

import ibis
gbif = ibis.read_parquet("s3://gbif-open-data-us-east-1/occurrence/2024-10-01/occurrence.parquet/**")

Or in R

library(duckdbfs)
gbif <- open_dataset("s3://gbif-open-data-us-east-1/occurrence/2024-10-01/occurrence.parquet/**")
leg2015 commented 1 week ago

Hi Carl, that's great to hear GBIF has AWS and parquet integration now! That will definitely be worth building into the data generation pipeline 🙂