tradle / tradleconf

CLI for managing your Tradle MyCloud instance
MIT License
5 stars 5 forks source link

Improve working with disabled MyClouds #27

Open martinheidegger opened 2 years ago

martinheidegger commented 2 years ago

When a MyCloud is disabled, the CLI currently shows errors during all new operations like in #26 because the underlying code can not look up information about a mycloud in the disabled state.

Currently re-enabling is a multi-step manual process that involves going into the AWS console.

To improve this I am thinking of 3 steps to improve the situation:

  1. If the cli lambda returns an empty result, we should use a AWS command to see if the lambda in question has been disabled, and if it has: throw an error
  2. Find a way to avoid the manual steps when re-enabling a MyCloud
  3. If the cli runs into a "mycloud disabled" error, it should offer the user a prompt, asking if they want to re-enable it.
urbien commented 2 years ago

I agree with point 3 which is easy to fix, but how big of a project is the rest? I suggest to try a different take first. Why do we even have disable function? It is because of Lambda cold starts - it takes AWS about 10sec to start a Lambda if it was not accessed recently, that is it is not warm. You can think of it this way:

So we are running a special Lambda that constantly pings other Lambdas, so that they stay warm. This is an expensive process (@spwilko, is it a $50 a month?), so tradleconf disable stops it Also disable stops the scheduler lambda that runs every minute, and runs various jobs that cost us money.

The alternative solution for the warmup is a configuration option released by AWS called "provisioned concurrency". We can play with it and see how much it costs per month, thus we may not need the disable function altogether, so no need to fix it. But we may need to have a "slow down" feature, which will decrease "provisioned capacity" and decrease the rate at which jobs run to once a day.

Hybrid strategy - dormant, but not cold

  1. Keep one Lambda (onMessage) provisioned at low concurrency (saving money) and start warming up other Lambdas upon the first customer request. At this point increase frequency of scheduler jobs from once a day to once a minute. This fits the pattern of mostly idle and occasional testing of our MyClouds. After testing ended, as evidenced by an hour delay in onMessage Lambda, we switch back to dormant.

MicroVM Snapshotting - future solution for "cold start"

In cloudpal, we plan to use MicroVM snapshotting mechanism to avoid cold starts. Snapshotting has been slowly productized for FireCracker MicroVM (underlying AWS Lambda) and is now quite reliable. But when AWS is going to start using it, is not clear, as there is also a challenge of uniqueness / randomness as each restored snapshot is identical (problem described here).

Snapshotting can be further significantly improved by super-awesome OS paging mechanism, called REAP. Compared to baseline snapshotting, REAP slashes the cold-start delays by 3.7x. It is tested with the help of Hive. REAP was a research project and at this point is not in active development.

But there is a more recent work, called SnapFaaS that analyzes limitations of REAP, and offers an alternative approach that claims significant improvement over REAP. Still, as paper the lower bound for this optimization is 15ms cold start. Language-specific sandboxing runtimes, like WASM, as above paper states, can achieve 10-20micro seconds cold start, and some CDNs already have such in production. But they are not-generic (we can't run there unless we rewrite MyCloud in Rust) and more importantly, they provide much lower level of protection from the host (cloud provider). SnapFaaS seems to be in active development at Princeton but still needs to solve the randomness problem.

spwilko commented 2 years ago

lambda costs for a single mycloud seem to be < $0.15 per day here's a breakdown for South America for 1 day Total cost ($) 0.96 DynamoDB ($) 0.35 S3 ($) 0.21 Key Management Se... ($) 0.17 Lambda ($) 0.13 CloudWatch ($) 0.1

martinheidegger commented 2 years ago

but how big of a project is the rest?

Each of the points suggested build upon each other and can be completed step-by-step. They are small steps, each easy to be done. Good for persons starting with mycloud/tradleconf development.

In the meantime I also thought of further ways to improve this situation beyond the initial tasks:

Offtopic: Slowdown / Why disable?
> we may need to have a "slow down" feature "Slow down" is a performance knob. I think that is an interesting thought but I think it should be additional/separate to a disable switch. After all: when I re-enable a MyCloud I want it to run in the same configuration as when it was disabled before. We can make a different issue for this?! > Why do we even have disable function? While the cost reasons are valid, I think there are two other valid reasons: - To disable a mycloud in "emergency": e.g. if one of the enabled products is destructive or before an update is available for a very problematic security issue. - Scheduled tasks can be not just expensive but also annoying (i.e. sending out emails) and pausing scheduled tasks may be a comfortable thing to do. The discussion on "why disable" or "how to disable" is something we should have but maybe let's do that in a separate issue?