mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

serverless? #142

Closed AlJohri closed 6 years ago

AlJohri commented 7 years ago

With AWS Lambda now supporting Python 3.6, I was wondering if you've considered taking portions of mediacloud and porting to a serverless architecture? For example just the downloading of story html.

I imagine the cost savings would be in order(s) of magnitude.

hroberts commented 7 years ago

We have considered and continue to consider this. We actually ran media cloud on ec2 for the first couple of years of existence. And we actually store all of our raw html in s3 now.

Our experience way back when was that it was very hard to get good database performance when wrestling with the opaque io system provided by the virtual machine. Database performance is much better in the cloud now, but I suspect this would still be a significant problem. Our current postgres server runs on an enclosure of 10-20 fast, server quality SSDs with mostly consistent and transparent performance characteristics, which is hard to reproduce in the cloud. We have the capability to move basically any part of the system off to the cloud, and we find that performance generally suffers for our data base systems when we do that, even when paying for very expensive cloud servers (and the database stuff is where the majority of our server resources go).

We already pay for hosting and bandwidth at MIT through our overhead, so the economics for local hosting make more sense for us. And we have invested in expensive (for us) servers that are mostly a sunk cost at this point but also have the feature that they will continue to run fine even if our currently generous funding level is not reproduced at some point. For a simple price point, last week we had to run our solr cluster in the cloud for one week to do some maintenance, and it cost us $170 / day. That would be cheaper if we made long term commitments, but it is still a huge ongoing cost to have to cover as an academic research project.

I don't have any experience with amazon lambda, but our various data processessing daemons consume a lot of cpu and memory resources, and when we have priced out reproducing the work we do on our local servers, the price tag has always come out very large. At some point you end up having to pay for computing cycles.

-hal

On Thu, Apr 20, 2017 at 11:14 AM, Al Johri notifications@github.com wrote:

With AWS Lambda now supporting Python 3.6 https://aws.amazon.com/releasenotes/5198208415517126, I was wondering if you've considered taking portions of mediacloud and porting to a serverless architecture? For example just the downloading of story html.

I imagine the cost savings would be in order(s) of magnitude.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/142, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT8GAtvtwpWjWclWN8C3Xia1_kEYoks5rx4RqgaJpZM4NDQa- .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

AlJohri commented 7 years ago

I see - that makes sense, I didn't realize it was hosted locally. I had assumed it was on EC2.

I mentioned AWS Lambda because it basically spins up an instance when it starts a task, keeps it alive or "hot" for some period of time in case it needs to perform other tasks, and then spins down. In terms of cost, you only get charged for the CPU cycles you actually use. This can be much cheaper compared to having an EC2 with worker listening to queue since you're paying for the worker's idle time.

This is a good explanation:

Lambda is a relatively new service (launched at end of 2014) that offers a different type of compute abstraction: A user-defined function that can perform a small operation, where AWS manages provisioning and scheduling how it is run What does “serverless” mean? This idea of using Lambda for application logic has grown to be called serverless since you don't explicitly manage any server instances, as you would with EC2. This term is a bit confusing since the functions themselves do of course run on servers managed by AWS. https://github.com/open-guides/og-aws#lambda

The cost from EC2 -> Lambda (for some tasks) may be much cheaper but I'm not sure how it compares to local hosting. In addition, this has the assumption that the worker processes / tasks have some amount idle time which may not be the case here.

AlJohri commented 7 years ago

Another good read: https://martinfowler.com/articles/serverless.html

Serverless architectures refer to applications that significantly depend on third-party services (knows as Backend as a Service or "BaaS") or on custom code that's run in ephemeral containers (Function as a Service or "FaaS"), the best known vendor host of which currently is AWS Lambda. By using these ideas, and by moving much behavior to the front end, such architectures remove the need for the traditional 'always on' server system sitting behind an application. Depending on the circumstances, such systems can significantly reduce operational cost and complexity at a cost of vendor dependencies and (at the moment) immaturity of supporting services.

AlJohri commented 7 years ago

Default concurrent execution limit on lambda raise to 1000: https://aws.amazon.com/about-aws/whats-new/2017/05/aws-lambda-raises-default-concurrent-execution-limit/

hi @hroberts, from your last comment I think you only priced running mediacloud on local servers vs remote servers (EC2). I think you still might want to consider lambda. I don't think this comment holds true in the case of serverless:

At some point you end up having to pay for computing cycles.

If you have some data you can make public regarding the cost, I'd be happy to run the numbers as well.

hroberts commented 7 years ago

Always happy to look again. The vast majority of the hard work our servers are doing is in the postgres and solr databases. The serverless stuff could help us run the various collection and processing jobs, but resource-wise those are by far the easiest for us to handle. The much harder thing is making the databases performant and cost effective. Could be that we could save moving that database stuff to the cloud, but it would be a heavy lift.

-hal

On Mon, May 8, 2017 at 11:55 AM, Al Johri notifications@github.com wrote:

Default concurrent execution limit on lambda raise to 1000: https://aws.amazon.com/about-aws/whats-new/2017/05/aws- lambda-raises-default-concurrent-execution-limit/

hi @hroberts https://github.com/hroberts, from your last comment I think you only priced running mediacloud on local servers vs remote servers (EC2). I think you still might want to consider lambda. I don't think this comment holds true in the case of serverless:

At some point you end up having to pay for computing cycles.

If you have some data you can make public regarding the cost, I'd be happy to run the numbers as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/berkmancenter/mediacloud/issues/142#issuecomment-299924742, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvvT2ZxCqX78-hItu3yblsiLuHxDeSDks5r30jtgaJpZM4NDQa- .

-- Hal Roberts Fellow Berkman Klein Center for Internet & Society Harvard University

rahulbot commented 6 years ago

Closing as something to keep in mind, but take no action on.