mekevans / forestTIME

1 stars 0 forks source link

Notes on gbifdb, AWS, remote/local access #11

Closed diazrenata closed 3 weeks ago

diazrenata commented 10 months ago

gbifdb uses duckdb to connect to either a locally downloaded snapshot or to an AWS S3 Bucket.

I did some research into S3 buckets - they're not free but they would be very cheap for our use case. Specifically, there's a free tier that lasts 12 months, and then the costs appear to be measured in cents (like, 2 cents/GB and 0.5 cents (yes, half a cent) per 1000 requests). I didn't do the math because I think we're unlikely to go this particular route for long-term upkeep, but link here: https://aws.amazon.com/s3/pricing/

This is leading me to look into other Amazon hosting services.

DynamoDB is a serverless NoSQL database. Unclear to me whether it supports access by non-AWS entities or if it's intended to integrate only with other AWS workflows. (Unlike with S3 buckets, I see no obvious "direct access" pipeline).

SimpleDB is a free database; they market it as a way to manage metadata associated with S3 buckets. Neither of these are obvious solutions.

For what it's worth, a single download of a GBIF snapshot is estimated at 180 GB. An early estimate of the forestTIME tables was for 1-2 GB. So AWS may be way more than what is needed for this project.

gbifdb is also a really nice example of an R package (not just a script) handling download and remote connections using duckDB.

And: note that gbifdb is doing almost no data wrangling; instead it passes an object that you can work with with dplyr verbs, or:

Users already familiar with SQL can construct SQL queries directly with DBI instead.

It seems to me that SQL + dplyr would together cover a large swath of the FIADB user base. Right?

(Does this need to converse with python as well?)

The database integration is happening via the use of duckdb for connection either locally or remotely. I don't know of any other R package doing this, and ours would be the first (tmk) using duckdb for remote not on AWS.

gbifdb interfaces to one massive table stored in parquet (columnar) format. (Columns here.) For forestTIME, we're talking at least 3 tables and probably more. (Side note, also: I really hate duplicating/redistributing data. Any possible way we could source from DataMart without reproducing the database?)