GE: Maintenance Windows

joefoxton commented 4 years ago

We should receive advance warnings about updates
We should have the opportunity to opt into various updates
How do we “pause” workloads, and ensure continuity after maintenance?

zaibon commented 4 years ago

We should receive advance warnings about updates What updates are you talking about ? 0-OS upgrade system is rolling release. There is nothing for a farmer to choose from. All the nodes from the grid receives update automatically. Update doesn't interrupt workloads.

We should have the opportunity to opt into various updates

Are you talking about new feature you would like to tests before they are release on production network ? If that so, you can run a couple of node on testnet to test new coming feature. Or even devnet is you really want to be at the edge of what is created but keep in mind that devnet can be relatively unstable.

joefoxton commented 4 years ago

Yes, the testnet solution would work for us.

However, the main issues being addressed in this ticket are

Concern about changes being made by ThreeFold tech that affect live customer workloads
Concern that hardware maintenance will knock out live customer workloads

We are open to any ideas on how to allay these concerns.

zaibon commented 4 years ago

Concern about changes being made by ThreeFold tech that affect live customer workloads

Regarding this, the simple answer we aim to no create such event. The goal is to have workloads left alone as much as possible. Now we know this is not a perfect world and things can happen. So today is we know there are going to be change that affect live workloads, I will always post this on the forum to announce it before it gets published on mainnet so the community has time to organize. It's a bit the same problem blockchain project have when they make a protocol change I think.

Concern that hardware maintenance will knock out live customer workloads

Regarding this, to me this is more a problem between the farmer and its customer. The farmer would need to communicate with its customer so the customer can plan migration or anything required for him. This has nothing to do with 0-OS or threefold IMO.

joefoxton commented 4 years ago

@zaibon I appreciate the realism here. There will be downtime. So we do appreciate that you would manage this as a scheduled outage, with fair warning and information about the impact. Thank you!

For us, email+telegram notifications are preferred as they are more 'alerting' than a forum post, which can easily be lost. Is it possible to be notified of these maintenance windows outside of the forum?

Re. Point 2 (Self Healing)... I do think there is responsibility for 0-OS here. Firstly, we are invested in the idea of self-managing, self-healing technology, which has been the promise of ThreeFold for some time. This was not originally posed as a responsibility of the farmer. Indeed, we have it plastered all over our website...

We read this promise as meaning that there are inbuilt mechanisms to capture state of a workload on failure/crash, and restore it cleanly, as transparently to the grid user as possible, such that another node can pick up the workload from the blockchain. This creates 'concerns' at the node level, 0-OS level, Jumpscale, and software level, all of which must be considered to come through on the promise.

To get more specific, we need to make sure the right signals are properly delivered & handled at all levels. This is mandatory for K8s clean running (ie. SIGTERM when scaling down), as well as handling unexpected failure conditions. This can't be left to a dApp developer, or chalked up to the idea that all software should be stateless. Even "stateless" software requires in-memory state, ain't no way around it. If we expect important workloads to run on our grid, then these workloads must be well educated on the signal handling and atomic transactional controls they will need to implement for self-healing to work. Moreover, S3 (min.io) and K8Ss will be expecting these signals and are not implicitly stateless.

Indeed, 0-OS would be responsible for sending these signals to all constituent processes, so the signals can be cleanly handled, state preserved, and self-healing / restart of workloads executed seamlessly. Please correct me if I'm off anywhere here!

Self healing means that this state is somehow preserved. At best, a software developer can only handle signals elegantly, and ensure all transactions that involve changing data synchronously are handled atomically. However, they can't best asked handle hard stops. This is where the self-healing promise should come into play at the 0-OS level.

To sum up, I'd say that if there are no self-healing provisions at the 0-OS level, then we can't claim self-healing publicly. Otherwise we are making claims that developers on our capacity would need to fulfil, which is disingenuous... We'd being making claims on their behalf...

Please fill me in if I'm missing something.

zaibon commented 4 years ago

For us, email+telegram notifications are preferred as they are more 'alerting' than a forum post,

Email might be possible since all the farmer have their email address configured in their farm. Telegram though is out of the picture since nothing in the system track those.

Regarding self-healing I want to stress out that I'm only talking about the 0-OS layer AKA capacity layer. The self-healing capabilities lies in the autonomous layer which is something that lives on top of the capacity layer. And for this I'm not able to provide detail cause I'm not involved there.

joefoxton commented 4 years ago

Ok fantastic. Email would be perfect. I can use email to trigger Telegram & Slack alerts using Zapier, so there is no excuse for missing the alert :)

I hear you on the self-healing. I guess our concern remains… that in flight workloads are not protected, and not self-healing as of now, and we can't guarantee uptime to our customers. This severely limits the kinds of workloads we can sell. I'll take it up with Weynand.

joefoxton commented 4 years ago

@zaibon @weynandkuijpers My esteemed colleagues at Green Edge have informed me that they were previously promised the following, over the past months and years.

Direct contact between Customer and farmer is not needed and not wanted
Maintenance does not effect the workload because of the fault tolerance of ZeroOs

What is your current position on these promises?

gneumann333 commented 4 years ago

For us, email+telegram notifications are preferred as they are more 'alerting' than a forum post,

Email might be possible since all the farmer have their email address configured in their farm. Telegram though is out of the picture since nothing in the system track those.

3 Thoughts on this 1 The 3SDk / Farmerbot is the main way for Farmers and Capacity users to interact with the grid. So information about updates and changes should be displayed there.

2 Trying to not influence or interrupt the Workloads due to Updates is honorabel ...but not always possible But after 1 or 2 failures due to updates or changes (that where not announced) User and Farmers will suspect any problem they are seeing to be due to TF "doing something again on the backend"

That would greatly effect the persieved relaiblity of the Grid So You NEED to inform about updates.

Also Can you point me to the current ZeroOs Development Roadmap please?

joefoxton commented 4 years ago

Just following up on our questions above...

joefoxton commented 4 years ago

Closing this, but would appreciate answers on the above items re:

Direct contact between Customer and farmer is not needed and not wanted
Maintenance does not effect the workload because of the fault tolerance of ZeroOs

threefoldtech / home

GE: Maintenance Windows #795