rivernews / media-literacy

Exploring possibilities of technology empowering media literacy
1 stars 0 forks source link

POC - Scraper on State machine #1

Closed rivernews closed 3 years ago

rivernews commented 3 years ago

AWS provides this great built in async worker mechanism, and there's a lot of potentials to save cost. Without this, we have to maintain running-all-day servers or at least managed EKS, and not to mention managing that distributed workload, and even build a management portal yourself.

Ideally, we can run scrapers on state machine. But there's some research we need to do before we confirm it's worth of carrying it out.

Cost Research

AWS on-demand billing - charge per execution

Is state machine similar to lambda and server less cost charging model, and can save us tons of money, comparing to self hosting a EKS while building a workload management system on our own?

Pro

Cons

Reusing existing k8s cluster

We have an existing k8s cluster, so we already are paying stuff. If the current k8s can fully cover this project need, then probably it makes more sense to just build our own servers running in our k8s cluster. But of course if we're even planning to migrate everything to lambda, and eventually let go of that k8s cluster, then this is not an issue.

Need to Consider CI/CD platform as well

But if the workload is small... probably Github Action or other CI/CD is a better platform, totally free of charge.

Cons

If it's really cheaper... wait, is deeper coupling with AWS a good move?

However, another perspective is if we want to really shift our gears, including future skillset, to all cloud computation and more coupled with one of the cloud providers, then we can say we might even want to migrate iriversland and appl-tracky backend to server less! (thank god the frontend probably don't need much change)

Of course it's debatable whether if we really want to put our entire self into these cloud provider's server less architecture, because any single of these providers are limited and could rarely compete with the open source world in terms of community (stack overflow contributions), documentation (some popular open source tool maintained by really passionate people), and user (developer) friendliness. These cloud providers usually prioritize their business needs when it comes to their tooling, whereas open source tools receives feedback from the entire community can could achieve developer-centered tools better.

That said, why not a hybrid solution. Serverless now there're people de-coupling it into an architecture, where you can swap in and out different cloud provider behind - called "vendor agnostic" server less architecture. If you're using SAM, then yes it's tied to AWS and thus its limited documentation, online community, and limited motivation to help developers outside of its business. But with the server less architecture, or even like Terraform, yet another open source tool (yet maintained by a private company), you gain the benefits of great tooling, better documentation and online community brought by Terraform, etc.

Open discussion: workload orchestration, async processing

Additional Keywords: workflow engine, workflow execution engine, long-running workflows

rivernews commented 3 years ago

POC

Run a scraper in step function, scrape html, and store it on S3. Knowing the basics to create a step function and required resources using Terraform.

Also see Terraform official doc.

rivernews commented 3 years ago

POC

Rethink about it - we want to be event-driven:

Reference

rivernews commented 3 years ago
image