rkhamis commented 5 years ago

Issue migrated from [https://api.github.com/repos/Jumpscale/lib9/issues/298](), opened by @zaibon

The capacity database should be able to give the percentage of uptime of a node over a month.

Proposal:

tX: time of update of the uptime
actual_uptime: uptime received from the 0-OS node

then we can compute the percentage of uptime like:

expected_uptime = tX - tX-1 #amount of second between 2 uptime update
if actual_uptime < expected_uptime:
   uptime += expected_uptime - actual_uptime
elif actual_uptime > expected_uptime
    uptime += expected_uptime # this is a strange case, not sure this could happens, except if the node tries to fake is uptime
else:
   uptime += actual_uptime

# then at the end of the month
percentage = uptime / (number_of_second_in_this_month)

rkhamis commented 5 years ago

commented by @zaibon This actually won't work, cause this will only check if the node is up, not if it is actually reachable over the network. So we need an external monitoring tool that check that the node is up and reachable

rkhamis commented 5 years ago

commented by @muhamadazmy @zaibon I think what we need is more of a heartbeat. Not sure though how this should be done for millions of machines without a centralized monitor.

rkhamis commented 5 years ago

*commented by @muhamadazmy

Suggestion 1

All node connects to well known and defined (may be even public) monitoring zerotier network. No node access is possible over this network
Nodes send heartbeats up stream.
Heartbeats are aggregated by a (yet to be defined) node in the same logical location of a set of nodes
This monitor node (may be it's elected somehow to take the honor of aggregating the heartbeats) should calculate the uptime of nearby nodes and move it up the tree. until it reaches infra structure nodes (threefold backbone if I may call it)
The uptime won't be real time

This suggestion defined new terms that are not really part of the system at the moment as u can see, which we can discuss further. This includes the (threefold backbone, or infra structure nodes) and the local monitoring node(s) for a region. *

rkhamis commented 5 years ago

*commented by @zaibon I like this idea. I don't think we need to do the aggregation per location just yet. The grid is small enough for all the node to send their own heartbeat to the "monitor node'

@muhamadazmy can you elaborate on how you would design the heartbeat and the monitoring node please*

rkhamis commented 5 years ago

*commented by @muhamadazmy If we won't need aggregation per location just yet, we can actually use the same capacity registration endpoint we have now https://capacity.threefoldtoken.com

My idea is as follows:

capacity registration should happen more frequently (may be 5min updates) not 2 hours
on capacity registration, an influx series should be updated which should include farmer id and node id (or similar identification for the node (mac, ip, zt id, etc)
Once the data is in influxdb, we can actually aggregate (and plot the following metrics)
- Total farmer capacity (now) and graph capacity change over time
- Total registered grid capacity (now) and graph grid capacity changes

Me and @zaibon did not agree if we should a push or pull mechanism to collect the capacity heartbeat though.

A more serious problem with either approaches is the trusting the reported capacity. Since the protocol is open anyone can start registering a fake capacity. We can of course ignore this issue for later but may be it's a good idea if we at least figure out a plan so is considered before designing the capacity reporting/heartbeats. *

rkhamis commented 5 years ago

*commented by @zaibon let's stick with the push approach, we have already all the infrastructure in place to do this ok so if we do that path, here are the task that I can already see we'll have to do:

[x] make the registration of the capacity happens more often (https://github.com/zero-os/0-templates/commit/141cdb4489762fcf492b25ac76ebe1207e6e24ae#diff-f480ff2159beef48225b4e750fcb9e00R23)
[x] update capacity website so it write data into influxdb when capacity is registered/updated
[ ] write code that compute/plot of the farmer capacity, node uptime etc...
[ ] update the capacity webstire UI to show the farm capacity graphs*

rkhamis commented 5 years ago

*commented by @andhartl Just look at my farm Maisaval 2: It has the correct geo ip on the node view but no location on the farmer view. There should be the same location on the farmer view. And if it is the wrong one you can correct it on the farmer view manually.

On 21. Jul 2018, at 06:12, Christophe de Carvalho notifications@github.com wrote:

Well if it's not show in the farmer page, that means that the farm has no location set. The node are automatically located using geo-ip. But the farmer can overwrite that location if needed.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Jumpscale/lib9/issues/298#issuecomment-406769120, or mute the thread https://github.com/notifications/unsubscribe-auth/ARutKLFN1HQfLGNVk7OhT0qwDodV4BlGks5uIqpHgaJpZM4UTQsa.

*

despiegk commented 5 years ago

think we are making this too complex, for now measuring local is ok ! just uptime measured when the node is up, independent of connection to internet, we can fix that one later

threefoldtecharchive / jumpscale9_lib

grid capacity: create algorithm to compute percentage of uptime over a month #40

Suggestion 1