Migrate to database - Githubissues

livgust commented 3 years ago

In preparation for data visibility, eligibility notifications, and availability notifications, having a proper database would allow us to more easily create these features.

Let's pick a DB structure and design and implement it. This would ideally include a backfill.

livgust commented 3 years ago

(Duplicate of https://github.com/livgust/macovidvaccines.com/issues/132, closing that one.)

tkjones117 commented 3 years ago

I'm interested in working on this! But I won't have a decent chunk of time until later this week, so anyone can start brainstorming before me. @zpeyton shared this as a potential option: https://fauna.com/ But I also think we should look at AWS options, since we already use it elsewhere. Probably https://aws.amazon.com/rds/postgresql/ or https://aws.amazon.com/dynamodb/.

We should lay out the pros/cons (even just in this ticket is fine) of a couple options so we know we made an informed decision.

dldereklee commented 3 years ago

What would the database store?

johnhawkinson commented 3 years ago

The database would store the output of all scrapers, so that we can do aggregate reporting on scraper output that is other than the latest run.

dldereklee commented 3 years ago

Gotcha. I can put together a basic Postgres database and import the existing S3 data if that would be useful.

johnhawkinson commented 3 years ago

Regardless of what db we adopt, were going to have to import all the historical data. If you want to take the time to do so now (in such a way that you can easily update it again when we do the transition), and do the grunt work of figuring out the appropriate table schema, that seems like work that will be helpful regardless of whether we go with a relational db or a nosql-whatever-its-called, so I don't think the work would be wasted.

zpeyton commented 3 years ago

In my opinion, something ready-made and scalable might be the right fit for a project at this stage, which is why I recommended https://fauna.com. We just throw in a few env vars and off we go. It will make adding new records really fast and we will be able to focus on writing/fixing scrapers, which seems like the highest priority based on the project backlog https://github.com/users/livgust/projects/2. Here are a few links that may help us get things going with minimal headaches:

JavaScript Examples - https://docs.fauna.com/fauna/current/cookbook/index.html?lang=javascript
Fauna va DynamnoDB - https://docs.fauna.com/fauna/current/comparisons/compare-faunadb-vs-dynamodb
Fauna vs Postgres - https://docs.fauna.com/fauna/current/comparisons/compare-faunadb-vs-postgres
Interview with the CTO explaining the benefits of focusing on product and not on database scaling and security and how he built Fauna because it's the data API they wished they had in 2010 when scaling Twitter. https://www.youtube.com/watch?v=kDe1G53VmJ0&t=151s

kriation commented 3 years ago

@zpeyton Fauna looks like a great option. However, I'm concerned about the cost of $25/month versus using DynamoDB in the existing AWS account where the site lives. The cost for DynamoDB considering the amount of data we're processing will be considerably less than what Fauna is charging. With regard to security, the role being assumed by the lambda to scrape can be modified to authorize read/writes against the DynamoDB.

I agree with using a noSQL DB versus a traditional RDBMS considering the majority of the data we're processing are JSON blobs.

I'm happy to chat on this topic in real-time.

zpeyton commented 3 years ago

@kriation The cost difference is negligible when you look at how much time it will take to write the backend code and manage the infrastructure necessary for getting the data in and out. It's a data API not only a database.

dldereklee commented 3 years ago

Out of the box GraphQL is pretty nice too https://docs.fauna.com/fauna/current/start/graphql

The equivalent with DynamoDB (API Gateway, Lambda, DynamoDB): https://www.serverless.com/blog/make-serverless-graphql-api-using-lambda-dynamodb

evan commented 3 years ago

@zpeyton asked me to weigh in here.

I am partisan to Fauna for obvious reasons.

Nevertheless, I think the biggest difference for a project like this is the modeling flexibility. Dynamo is a key-value store at heart, and requires you to lay out your data according to its query pattern. This make iterating on your app difficult. Fauna is document/relational and lets you store data how you please. You can query that data directly, or use indexes (which are really more like relational views) to optimize and transform the data to back specific UIs and queries.

Fauna is also transactional, which simplifies development. Although you are scraping rather than managing bookings directly, at least for now, the "Ticketmaster" use case is kind of the canonical example of the value of transactions, because you can't have two people reserving the same slot. I imagine keeping the data consistent will be useful even for scraping. DynamoDB transactions come at a performance and $ cost and do not offer the same level of flexibility.

Finally, Fauna can be queried directly from mobile and web apps, which eliminates the need for a user-facing application server to secure and query DynamoDB. I think this would be a big benefit for this project.

Let us know how we can help; this is an awesome project. I recently went through the MA vaccination process and it was a real mess.

tkjones117 commented 3 years ago

Thanks so much @evan ! We'll certainly reach out for more info or assistance if we use decide to use Fauna.

zpeyton commented 3 years ago

@evan Thank you again for taking time yesterday to speak with me. After we spoke, I played around with the faunadb node package and got the basics of what we need done in probably a total of 3 hours. See here https://github.com/livgust/covid-vaccine-scrapers/compare/fauna It handles the backfill we need and lists data by site. Kudos on a great product and on your fundraising round.

tkjones117 commented 3 years ago

After discussing this with @zpeyton and @livgust yesterday, we're going to move forward with Fauna, with the understanding that we can always switch to something else if it's not meeting our needs. See our wiki for reasoning here: https://github.com/livgust/covid-vaccine-scrapers/wiki/DB-Tradeoffs

The first step (which I'm working on now) is to read/write to the Collections that Liv has created. I'll use Zach's code as a jumping off point.

tkjones117 commented 3 years ago

I broke this big ticket into multiple smaller ones here: https://github.com/livgust/covid-vaccine-scrapers/milestone/1

I'll close this one in favor of that. But while I have your attention, I would appreciate some eyes on the first PR for this migration! https://github.com/livgust/covid-vaccine-scrapers/pull/150

livgust / covid-vaccine-scrapers

Migrate to database #131