maidsafe / safe_network

Autonomi combines the spare capacity of everyday devices to form a new, autonomous, data and communications layer of the Internet
http://autonomi.com
76 stars 48 forks source link

feature: add metrics for chunk and system storage space #360

Open happybeing opened 1 year ago

happybeing commented 1 year ago

The logs used to include the following metrics which I displayed in vdash and think would be useful to have again. So I wonder if they can be added to the metrics module in safenode:

I can't get these from the local system because there may be multiple safenode processes running, and vdash monitors multiple nodes by displaying any one of any number of available safenode.log files.

happybeing commented 1 year ago

@joshuef I'm replying to your request for feedback on logfile messages to keep things in one place. The following are messages strings I am currently matching for the given metrics, while the OP suggests additional metrics it may be useful to have but which we did have at one stage.

PUT: "Wrote record to disk" GET: "Retrieved record from disk" REGISTER EDIT: "Editing Register success!" (untested) ERROR: any logfile message of type 'ERROR'

I have a crude mechanism for categorising node status as: Stopped|Connecting|Connected|Disconnected. I don't know if a node could provide a more definitive state message as a periodic output to the logfile (not just on change), but if so that would improve accuracy.

I'm not sure what would be helpful to add in other areas both for devs, if you use vdash at all? Or for users interested in monitoring the working and performance of their nodes, but it would be nice to find some metrics to at least reflect ongoing activity of different network features related to each node, so things like register edits, CRDT and DBC related activity, and of course node earnings!

Of course I'm open to any suggestions for things that your team would like.

joshuef commented 1 year ago

space used for chunks space used for registers

Would be hard as they're all records now. (hard as in reading from disk). So i'd be inclined to just leave that to a sys level check of the record_store dir. Not sure how you feel about that?

We have eg: https://github.com/maidsafe/safe_network/blob/main/sn_node/src/log_markers.rs#L21

Connecting would just be everything before that. Though perhaps we can add an initial message when we start attempting to connect to the first peers (could be done via https://github.com/maidsafe/safe_network/issues/518)

could provide a more definitive state message as a periodic output to the logfile

Hmm, outwith of connecting/stopped it would (in theory) just always be connected. Not sure if that's that useful? (Or are you imagining more states? We have some kbucket logs of peers counts that may he more granual about the state of connectivity? https://github.com/maidsafe/safe_network/blob/main/sn_networking/src/event.rs#L458

can't say we as a team use vdash as yet. Everything is headless and we're just looking at grabbing basic stats of nodes to determine any major issues that may be in play thus far

happybeing commented 1 year ago

sys level check of the record_store dir. Not sure how you feel about that?

Yes, whatever is easy and useful.

For state: starting,connecting(ed) etc I'm envisaging it as a proxy for "things seem to be ok, or not". So losing connectivity, being stopped etc. or for anything that might reasonably happen that the operator might want to know about.

So while connected a periodic "all ok" type message could be logged and I would flag it as a problem if this wasn't seen for too long. As well as showing any other states beforehand.

I'm really not sure what is best here, but think it is useful to have something in the dashboard that the operator can look at and instantly go, oh a that's not right.

Showing number of peers sounds good. Any other suggestions welcome as I don't spend much time analysing logs or thinking about the ATM.

I can work with what we have but wanted to see if you thought it worthwhile exposing more general state like info.

Thanks for looking at this.

joshuef commented 1 year ago

For state: starting,connecting(ed) etc I'm envisaging it as a proxy for "things seem to be ok, or not". So losing connectivity, being stopped etc. or for anything that might reasonably happen that the operator might want to know about.

For the mo, I think the kbucket logs are a proxy there. If we lose everything / are in decline something is up. As we get to know the network, the kbucket may fluctuate a bit, but really should not be descreasing in peer count. Anyone out should be replaced as long as the network is healthy eg.

happybeing commented 1 year ago

So if I display peer count and a max peer count, maybe red on some condition.

What would you suggest?

joshuef commented 9 months ago

There is a new NetworkInfo struct coming from libp2p which we now log on peer/connection changes. That's a nice sumary that may be useful?