Closed livgust closed 3 years ago
(Duplicate of https://github.com/livgust/macovidvaccines.com/issues/132, closing that one.)
I'm interested in working on this! But I won't have a decent chunk of time until later this week, so anyone can start brainstorming before me. @zpeyton shared this as a potential option: https://fauna.com/ But I also think we should look at AWS options, since we already use it elsewhere. Probably https://aws.amazon.com/rds/postgresql/ or https://aws.amazon.com/dynamodb/.
We should lay out the pros/cons (even just in this ticket is fine) of a couple options so we know we made an informed decision.
What would the database store?
The database would store the output of all scrapers, so that we can do aggregate reporting on scraper output that is other than the latest run.
Gotcha. I can put together a basic Postgres database and import the existing S3 data if that would be useful.
Regardless of what db we adopt, were going to have to import all the historical data. If you want to take the time to do so now (in such a way that you can easily update it again when we do the transition), and do the grunt work of figuring out the appropriate table schema, that seems like work that will be helpful regardless of whether we go with a relational db or a nosql-whatever-its-called, so I don't think the work would be wasted.
In my opinion, something ready-made and scalable might be the right fit for a project at this stage, which is why I recommended https://fauna.com. We just throw in a few env vars and off we go. It will make adding new records really fast and we will be able to focus on writing/fixing scrapers, which seems like the highest priority based on the project backlog https://github.com/users/livgust/projects/2. Here are a few links that may help us get things going with minimal headaches:
@zpeyton Fauna looks like a great option. However, I'm concerned about the cost of $25/month versus using DynamoDB in the existing AWS account where the site lives. The cost for DynamoDB considering the amount of data we're processing will be considerably less than what Fauna is charging. With regard to security, the role being assumed by the lambda to scrape can be modified to authorize read/writes against the DynamoDB.
I agree with using a noSQL DB versus a traditional RDBMS considering the majority of the data we're processing are JSON blobs.
I'm happy to chat on this topic in real-time.
@kriation The cost difference is negligible when you look at how much time it will take to write the backend code and manage the infrastructure necessary for getting the data in and out. It's a data API not only a database.
Out of the box GraphQL is pretty nice too https://docs.fauna.com/fauna/current/start/graphql
The equivalent with DynamoDB (API Gateway, Lambda, DynamoDB): https://www.serverless.com/blog/make-serverless-graphql-api-using-lambda-dynamodb
@zpeyton asked me to weigh in here.
I am partisan to Fauna for obvious reasons.
Nevertheless, I think the biggest difference for a project like this is the modeling flexibility. Dynamo is a key-value store at heart, and requires you to lay out your data according to its query pattern. This make iterating on your app difficult. Fauna is document/relational and lets you store data how you please. You can query that data directly, or use indexes (which are really more like relational views) to optimize and transform the data to back specific UIs and queries.
Fauna is also transactional, which simplifies development. Although you are scraping rather than managing bookings directly, at least for now, the "Ticketmaster" use case is kind of the canonical example of the value of transactions, because you can't have two people reserving the same slot. I imagine keeping the data consistent will be useful even for scraping. DynamoDB transactions come at a performance and $ cost and do not offer the same level of flexibility.
Finally, Fauna can be queried directly from mobile and web apps, which eliminates the need for a user-facing application server to secure and query DynamoDB. I think this would be a big benefit for this project.
Let us know how we can help; this is an awesome project. I recently went through the MA vaccination process and it was a real mess.
Thanks so much @evan ! We'll certainly reach out for more info or assistance if we use decide to use Fauna.
@evan Thank you again for taking time yesterday to speak with me. After we spoke, I played around with the faunadb
node package and got the basics of what we need done in probably a total of 3 hours. See here https://github.com/livgust/covid-vaccine-scrapers/compare/fauna It handles the backfill we need and lists data by site. Kudos on a great product and on your fundraising round.
After discussing this with @zpeyton and @livgust yesterday, we're going to move forward with Fauna, with the understanding that we can always switch to something else if it's not meeting our needs. See our wiki for reasoning here: https://github.com/livgust/covid-vaccine-scrapers/wiki/DB-Tradeoffs
The first step (which I'm working on now) is to read/write to the Collections that Liv has created. I'll use Zach's code as a jumping off point.
I broke this big ticket into multiple smaller ones here: https://github.com/livgust/covid-vaccine-scrapers/milestone/1
I'll close this one in favor of that. But while I have your attention, I would appreciate some eyes on the first PR for this migration! https://github.com/livgust/covid-vaccine-scrapers/pull/150
In preparation for data visibility, eligibility notifications, and availability notifications, having a proper database would allow us to more easily create these features.
Let's pick a DB structure and design and implement it. This would ideally include a backfill.