spacemeshos / SMIPS

Spacemesh Improvement Proposals
https://spacemesh.io
Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

Time synchronization hardening #55

Open dshulyak opened 2 years ago

dshulyak commented 2 years ago

Overview

Spacemesh clients must have synchronized clocks in order to participate in the network. Block contextual validity depends on the time. Participating in the Hare consensus requires nodes to send messages in specific time windows.

We are using NTP to detect local clock drift, and require node operator to configure the clock correctly. Usually it must be done by using NTP software. Otherwise over time clock will drift apart too much because of the frequency error. In Network Time with a Consensus on Clock. Part 4 it is estimated that frequency error of the clock over 12h can reach 0.75s. It means that max potential difference between any two nodes is 3s after 24h.

Goals and motivation

NTP might be compromised and we want to be able to detect if such an event happens.

High-level design

We will compare our local clock with peers clocks. If the clocks difference is within certain bound we assume that the NTP and our peers are honest. If clocks are too far off, then either:

In such event operator will be notified and will have to manually check that NTP source that he is using provides accurate time.

Proposed implementation

Periodically, and immediately when node starts, we will run a round for collecting samples. Rounds are short and timed. Rounds will be run every 1h. Responses have to be received in 5s time window.

During the round, we will collect samples from multiple peers. Outgoing neighbors will be prioritized, but if we can't establish a connection with target number of outgoing neighbors after 5m any peer will be used.

If round (5s) is finished without collecting min number of responses, we will retry the round with exponential backoff.

At the start of the round node will save local clock to t1. Send request to a peer. Peer saves arrival time of the request to t2. And then sends response with t2 and t3. t3 is obtained using a local clock when response is sent. When response is received we get t4 using local clock.

Multiple timestamps are required to estimate rount trip time, and remove it from estimated offset. To compute offset we will first calculate rtt, which is:

rtt = (t4 - t1) - (t3 - t2)
offset = t2 - t1 - rtt/2

Offsets are sorted and the median is selected. If the absolute offset value is larger than 1s the error will be printed. After 10 (10h) consecutive errors we will stop the node. The node operator is assumed to be monitoring the node. Once NTP warning is resolved (either by changing NTP servers, or just checking that they provide accurate time) operator will have to send API request to spacemesh node to reset the error counter.

Implementation plan

Network protocol

type Request struct {
    ID uint64
}

type Response struct {
    ID uint64
    ReceiveTimestamp uint64
    SendTimestamp uint64
}

API

Simple API handler for NodeService that will reset the error counter.

Questions

What if there is not enough peers?

We should have minimum number of samples. For devnet this number will be really low, for testnet slightly higher, and even higher for mainnet.

We won't be able to start the complete the round if min number of samples are not collected during the timesync round.

Dependencies and interactions

Stakeholders & Reviewers

@antonlerner @dshulyak

Testing and performance

Alternatives

Weighting samples by importance of the peer

In NEM protocol each sample is weighted by the importance. Importance is computed from blockchain data.

This assumes that blockchain needs to be synced first, and has a downside of revealing connection beteen p2p identity and miner identity.

Measure gossip delay using special role (ping responder)

Consensus participants may select a peer as a ping responder. Such peer will be selected using VRF. The problem with this approach is that we need to have synchronized times on all nodes on the network, not only consensus participants.

avive commented 2 years ago

I have some questions about the motivation and effectiveness of this smip that I hope can be clarified a bit before this goes into development:

  1. How realistic is the motivation that quorum of { Apple, Google, NIST, ntp.org Microsoft } NTP servers will be compromised for more than a short period of time, when Windows and macOS systems are configured by default to use NTP to sync computer clocks? Seems to me like basic modern Internet infrastructure that is highly maintained and likely >1B devices depend on it.

  2. Assuming quorum of the above NTP servers doesn't agree on Internet time - what exactly are we expecting the node operator to do in order to keep the node functional so we don't need to shut it down? Somehow provide the node with an Internet time measurement? Where is he supposed to get it from?

  3. I think that what we want is a fully automatic algorithm such as the one that Parity implemented in Substrate and panic nodes if that algorithm fails because it is not clear to me how an operator intervention can help in case of time fault. So the motivation to avoid reliance on a small number of centralized public Internet time servers is a good one but the solution should be do to use a decentralized p2p algorithm to establish consensus on time on the network. We know these kind of algorithm exist, and are used in production in p2p networks such Polkadot. I recommend doing more research into their algorithm and evaluate if it is satisfactory for our quite similar requirements. If this was done then please explain why a fallback to human-intervention is needed and how will it work.

  4. Assuming the node operator is supposed to somehow find a good Internet time source and provide it to node - isn't it better to do this just by adding support to custom NTP servers in the config file instead of doing this via the api? Seems more logical to me. So, a node operator who doesn't want to trust 5 ntp time sources can set a config file with 20 Internet time servers and his node will use the average time of results from these. Seems a much simpler solution to the problem of small number of NTP servers being compromised.

My argument is as follows:

  1. NTP is a decentralized time protocol. If we keep relying on this protocol for Internet time then we can work to make it more robust to compromise of some servers at any given time. Node operator intervention should be possible by modifying the list of NTP servers to use in node config file.
  2. If we argue that we don't want to rely on the NTP protocol for Internet time then we should build our own replacement that is based on node's notion of Internet time similarly to what Parity did for Polkadot and this algorithm should be automatic. When it fails (due to exceptional conditions) the node should panic and no node intervention is needed.
dshulyak commented 2 years ago

How realistic is the motivation that quorum of { Apple, Google, NIST, ntp.org Microsoft } NTP servers will be compromised for more than a short period of time, when Windows and macOS systems are configured by default to use NTP to sync computer clocks? Seems to me like basic modern Internet infrastructure that is highly maintained and likely >1B devices depend on it.

As I understand NTP is not a secure protocol, so attacks on the client itself are possible (https://www.cs.bu.edu/~goldbe/papers/NTPattack.pdf). If they are still possible, it doesn't matter how protected the server is. I don't know what the operator is supposed to do in this case, to be honest.

Assuming quorum of the above NTP servers doesn't agree on Internet time - what exactly are we expecting the node operator to do in order to keep the node functional so we don't need to shut it down? Somehow provide the node with an Internet time measurement? Where is he supposed to get it from?

I assumed that he will have to change the set of NTP servers, if it will be obvious that he was provided with an invalid time.

I recommend doing more research into their algorithm and evaluate if it is satisfactory for our quite similar requirements. If this was done then please explain why a fallback to human-intervention is needed and how will it work.

there was a discussion on a research forum

Assuming the node operator is supposed to somehow find a good Internet time source and provide it to node - isn't it better to do this just by adding support to custom NTP servers in the config file instead of doing this via the api? Seems more logical to me. So, a node operator who doesn't want to trust 5 ntp time sources can set a config file with 20 Internet time servers and his node will use the average time of results from these. Seems a much simpler solution to the problem of small number of NTP servers being compromised.

an operator will have to add good ntp servers using a config file for NTPD (or similar software). But API call or restart of the spacemesh client is required to let us know that it was a false alarm, and actually, peers are providing us with invalid time. Or the problem is fixed and we can re-validate the clock.

avive commented 2 years ago

I think that the implementation should include having an ntp servers list in the config file just like we have for the forked node for Chinese users that @lrettig created and adding a good list of sources for testnet config files. Doing so will make it unnecessary have this special node release for Chinese users and we can just make available a different config file for these users for testnets and mainnet that have a NTP servers that work behind the great firewall.

I still don't see much sense in providing an api to reset the error proposed here because dealing with this error condition should involve the user changing the ntp server in the config file (or for advanced home users running ntpd on linux systems) and restarting the node. So I think that the remedy for the node getting out of time by the proposed new time checking logic should be to panic the node with an error message that instructs the user to modify the ntp servers in his config file, check his OS computer clock (and consider syncing it with an OS provider time server) and start it again.

dshulyak commented 2 years ago

We allow to set NTP servers in our config file. But they are not used to adjust the time. We are using them to check that the time from these servers is in sync with system time (probably adjusted by NTPD). Do you suggest to adjust time ourselves, without using NTPD and such?

What if the error is not because of the NTP servers, but because of the malicious peers? In such case node will have to be restarted with an option to disable this sanity check using peers. And unless we want to disable it permanently it would be better to use API for this kind of stuff.

avive commented 2 years ago

We allow to set NTP servers in our config file. But they are not used to adjust the time. We are using them to check that the time from these servers is in sync with system time (probably adjusted by NTPD). Do you suggest to adjust time ourselves, without using NTPD and such?

I think we should recommend to all node users to sync their OS clock with their OS provider time server and this will ensure vast majority of nodes will have good Internet time. This check is periodic and should adjust the time automatically for vast majority of users. I need to double check but I suspect that Internet time sync in macOS and Windows 10 is on by default.

What if the error is not because of the NTP servers, but because of the malicious peers? In such case node will have to be restarted with an option to disable this sanity check using peers. And unless we want to disable it permanently it would be better to use API for this kind of stuff.

So how does it look from the user-perspective? node has identified a time issue due to bad peers and we are asking the user to click a button in the client to trigger an api call? Can you please clarify so we can conceptualize this? What is the message that is displayed to the user? What should be the button label?

dshulyak commented 2 years ago

So how does it look from the user-perspective? node has identified a time issue due to bad peers and we are asking the user to click a button in the client to trigger an api call? Can you please clarify so we can conceptualize this? What is the message that is displayed to the user? What should be the button label?

yes, if we are talking about UI it would be a pop-up with a warning: "time is not in sync with peers. please check your NTP settings/time provider". I actually don't know what would be the label for the button, maybe "Resolved", meaning that it should be clicked that the issue was resolved by removing invalid servers or ignoring the issue if the operator is certain that the time on his end is correct.

avive commented 2 years ago

yes, if we are talking about UI it would be a pop-up with a warning: "time is not in sync with peers. please check your NTP settings/time provider". I actually don't know what would be the label for the button, maybe "Resolved", meaning that it should be clicked that the issue was resolved by removing invalid servers or ignoring the issue if the operator is certain that the time on his end is correct.

So the clarify, are we expecting the user to sync his system clock with Internet time? If that's the case then it is okay but if we expect the user to change the NTP servers in the node's config file then it is a different thing - we need to provide ui for that in Smapp because smapp users are not expected to be able read/write json. We need to add NTP servers configuration support to Smapp settings to properly support this.

Also, how will this condition be reported by the API? We have a pattern for node errors streams that stateful clients are supposed to subscribe to. So, part of the implementation should be to add this error as a node error to the stream with information that will allow special handling in the client. e.g. via agreed upon error status codes. I don't think that another api end-point is needed just for this error case.

dshulyak commented 2 years ago

So the clarify, are we expecting the user to sync his system clock with Internet time? If that's the case then it is okay but if we expect the user to change the NTP servers in the node's config file then it is a different thing - we need to provide ui for that in Smapp because smapp users are not expected to be able read/write json. We need to add NTP servers configuration support to Smapp settings to properly support this.

I don't think that we need to change the config for SMAPP. The user is responsible for configuring the system clock (this is the case even now), he can do it in any way that works for him (e.g. Internet time, ntpd, other daemon that uses NTP).

Also, how will this condition be reported by the API? We have a pattern for node errors streams that stateful clients are supposed to subscribe to. So, part of the implementation should be to add this error as a node error to the stream with information that will allow special handling in the client. e.g. via agreed upon error status codes. I don't think that another api end-point is needed just for this error case.

we don't need an additional API to report the error. But we need another API call, because we can't don't know where was the problem (with user system clock, or with peers clock). So after user resolves the problem this API call must be triggered, and we will take appropriate action (e.g. blacklisting peers that report invalid time).

avive commented 2 years ago

I don't think that we need to change the config for SMAPP. The user is responsible for configuring the system clock (this is the case even now), he can do it in any way that works for him (e.g. Internet time, ntpd, other daemon that uses NTP).

However, he may want to change the NTP servers config so we need to expose it in the Smapp settings anyhow.

we don't need an additional API to report the error. But we need another API call, because we can't don't know where was the problem (with user system clock, or with peers clock). So after user resolves the problem this API call must be triggered, and we will take appropriate action (e.g. blacklisting peers that report invalid time).

Ok, thanks for the clarification, this makes sense.

tal-m commented 2 years ago

The system clock is problematic, because it also requires the system timezone to be correctly set. I would recommend we support NTP directly in the node, with the system clock as a fallback if the NTP servers can't be reached.

Do Chinese users have a problem resolving pool.ntp.org? This service provides addresses that should be resolved dynamically based on geolocation and give a nearby NTP server. In any case, I think we can support most users using a small list of NTP servers (which users would not need to change), such that a few of them are supposed to be reachable from any location. The node can ping all of them in parallel and use the median of returned results.

To make things less likely to fail for the users due to network reachability , we could also use HTTP time as a source; i.e., have a list of http servers that we query and use the server time returned in the header. This can be done over https to popular servers (e.g., google.com, microsoft.com, etc.), so will likely not be blocked.

HTTP has much lower precision than NTP, but we don't actually need 1s precision: we can allow much larger time differences with peers as long as the time difference+network delay is less than our \delta parameter. These are parameters we really want empirical data in order to set. For example, if it turns out that the http time offset is almost always under 5 seconds, and the network delay is under 30, we can set \delta to 35 can allow up to 5 seconds disagreement with peers until we pause the node and ask for user assistance.

Regarding what we ask the user to do: I don't think we want to expose the user to NTP settings. Instead, I propose that we just ask the user something like "Is your timezone XXX?" and "Please fill in the local time : : __ and press the button OK when the time is exactly correct". (the message should make it clear that we care about seconds, not just hour/minute.)

Once the user sets the time, we can drop incorrect peers and/or time servers. We can then record the offset from the system time, and mark the system time as a trusted source (until the next boot, system time update or X time has elapsed, after which we revert to our default behavior).

avive commented 2 years ago

Regarding what we ask the user to do: I don't think we want to expose the user to NTP settings. Instead, I propose that we just ask the user something like "Is your timezone XXX?" and "Please fill in the local time : : __ and press the button OK when the time is exactly correct". (the message should make it clear that we care about seconds, not just hour/minute.)

I find it hard to understand it.

tal-m commented 2 years ago

It is clear that the time-zone is available from the OS and is correct unless the user is doing advanced weird thing on his pc. I've never seen any desktop app that asks for your timezone. It is an OS level settings for desktop users.

First, desktops have their timezones wrong in many cases -- automatic installs might set a wrong default timezone, users might move and forget to update the timezone etc. What's more, different operating systems store the system time differently --- in some cases the system time is local time, in some cases it's UTC. If we want the user to confirm the time, we have to confirm the timezone as well.

Note that this dialog only occurs when something very weird is happening; conditioned on our time being different than our peers, I think it's more likely that the user has an incorrect time configuration than that they are being attacked.

Are you serious about asking the user for local time where he supposes to click 'ok' quickly? I'm afraid this is not a good user experience and not really needed. It is much easier and better to tell the user to sync his OS time using one checkbox in the OS date-time settings (it is pretty much the same across desktop linux, macOS and Windows 10) and we pick the time from OS after he's done. it is a very reasonable requirement to ask users to sync their OS clock if we detect it is out of sync.

The exact time is necessary because we need sub-minute precision; if network time sources aren't available, this is the only option left. I think it is reasonable to fall back to the system time if NTP doesn't match peers, but the system time does. I this case we might not need to pause the node, jiust give a warning message.

Regarding NTP settings - we will need to do this to support Chinese users. I never liked the idea of having a special build to behind the firewall users although it is good we are making an effort to support them. The advanced settings should allow user to add NTP servers and remove existing ones - it is basically a non-technical-users editors for the NTP servers section of the config file. Once we have this we can ship only 1 global release of smapp and instruct people behind the firewall to modify ntp servers in smapp. In the future, we can support a china specific config file in the discovery service instead of having 1 global config file (what we have now). Once we do that, there will also be no need for Chinese users to edit NTP servers in settings - they just need to tell Smapp to change the config file to the behind-the-firewall one - a one-click operation.

I agree that we don't want users to manually have to specify NTP servers. But I don't think we need different config files for different areas (for this purpose, at least). Why can't we just include in the list of NTP servers ones that work in China? The nodes in China will not get responses from servers outside, but will have enough responses from accessible servers to work, while the nodes outside may or may not get responses from the China-accessible servers, but it doesn't really matter.

Also, the http fall-back I suggested should work from anywhere, and my guess is that it will have high enough precision for our purposes.

avive commented 2 years ago

First, desktops have their timezones wrong in many cases -- automatic installs might set a wrong default timezone, users might move and forget to update the timezone etc. What's more, different operating systems store the system time differently --- in some cases the system time is local time, in some cases it's UTC. If we want the user to confirm the time, we have to confirm the timezone as well.

I disagree with this argument. What evidence do you have to support it? I think it is quite reasonable to assume that 90% of the users have their OS clock and time zone synced with OS provider time server. The point is that this is an a classic operating system settings and not individual desktop apps settings. My argument evidence: >80% of users never change the defaults and I believe the defaults for both Windows and Mac is to get time settings from the OS maker Internet time server. I will double check.

Note that this dialog only occurs when something very weird is happening; conditioned on our time being different than our peers, I think it's more likely that the user has an incorrect time configuration than that they are being attacked.

Fine, so the instructions should be to fix the OS time (it is an OS settings) and not input time manually into Smapp (a desktop app). On Windows, if user tries to do this and fails then the OS gives an error message. In other words, it is designed to deal with it already.

The exact time is necessary because we need sub-minute precision; if network time sources aren't available, this is the only option left. I think it is reasonable to fall back to the system time if NTP doesn't match peers, but the system time does. I this case we might not need to pause the node, just give a warning message.

What is the probability of time.apple.com and time.microsoft.com not available as a resource for a computer that has an Internet connection more than few hours in a year? I argue it is negligible. I bet their global uptime over the last few years is very high. The point is that system time should match ntp time - it is synced via NTP by the OS. In addition, both macOS and Windows support syncing time and time zone from multiple time servers - not just from one server. Pretty sure ubuntu supports this as well but only checked on these two OSes.

To summarize, what I argue in terms of reasonable and satisfactory user experience for smapp when the node detects that the system clock may be off is to display a modal dialog box and prompt user to sync their computer clock via the OS date-time settings and to click OK when done but not prompt for manual time input in the app. When they click okay we call the new api method suggested in this smipp. On the node level, node should report this error via the NodeService::ErrorStream so clients such as smapp or other clients can choose how to alert the user about it.

tal-m commented 2 years ago

I disagree with this argument. What evidence do you have to support it? I think it is quite reasonable to assume that 90% of the users have their OS clock and time zone synced with OS provider time server

You're looking at it the wrong way. These mechanisms only come into effect when something goes wrong. So we don't care about the 90% (or even 99%) of the time when things are ok. For example, suppose you find a hiker on a hilltop in the wilderness who appears to have been electrocuted. What is the probability that the hiker was struck by lightning? You don't care about the fact that 99.999% of hikers don't get struck by lightning, because 99.99% of hikers also don't end up electrocuted in the wilderness. Conditioned on being electrocuted in the wilderness, the probability that the hiker was struck by lightning might be quite high.

Getting back to time servers, if we look only at the probability of "honest mistakes", it's probably true that the major internet time servers have a very high uptime. But if we're thinking of adversarial attack, this might not be the case --- the protocols for time syncing are not hardened against malicious attacks: it might be that taking time.microsoft.com down globally is hard, but it's probably much easier to take over an ISP and hijack connections from computers that connect through it.

The problem with just asking the user to sync the computer clock and tell smapp when they've done so is that users typically don't verify that their clock is synced with high accuracy. Consider an attack of the type described above, in which some ISP hijacks time server connections, and changes the time by only 30 seconds. In this case, a user who syncs the time probably won't notice that their time is incorrect (the OS often doesn't even display the seconds). If we ask the user to sync time and press "OK", we can't trust that the system time is accurate enough to decide which of our peers are sending us incorrect time values.

Having said that, if you can suggest a nicer UX that would still ensure we can trust the user-accepted time with 1-2s accuracy, that would be great (in any case it's probably a good idea to do some user testing for this type of UX).

Finally, I don't think we should limit our design to desktop OSes that are in constant use --- I think it's quite likely that people will use old computers to run a node, or run it on headless servers; these are even more likely to have an incorrectly set system timezone. These types of nodes might not even be running smapp, but they should still be running the same node, with the same guarantees for time synchronization integrity. (This includes large-scale miners, who almost certainly will be running in headless mode -- and who could have a much worse effect on the network than individual small miners if their time synchronization is successfully hacked)

tal-m commented 2 years ago

@avive: The whole point of this SMIP is dealing with edge cases. But I think it is a good idea separate the UX/UI issues from the node operation and API.

Node operation

I think the following is reasonable (basically what @dshulyak proposed, with some extra fallbacks):

  1. Query the median time of peers as the "sanity check" against which our actual clock is measured.
  2. Query NTP servers (from a fixed list that includes at least a few servers that should be available from every geographic location). Verify median time returned against median of peers. If they match --- we're done, we have our time. Otherwise,
  3. Query HTTPS servers (again from a fixed list) and use the Date header returned by the server to estimate server time. If the median matches the median peer time, we're done. Otherwise,
  4. Query the local system time (converting to UTC, of course). If it matches the median time of peers, we're done. Otherwise,
  5. Pause any time-sensitive node operations and notify the user that there is a problem (either there's an attack on the node's time protocol, or a majority of its peers are malicious).
    • Case 4(a): If user does not respond, continue to occasionally repeat these operations (with node still paused); if time sync check is successful then resume normal operations.
    • Case 4(b): If user responds and gives current time manually, set this as the correct time, and treat all peers with different times as malicious. In this case we can mark the offset from the system clock and use this as a "trusted" time source for a period of time.

We could also always run (1), (2) and (3) in parallel, and complain if there are inconsistencies (without pausing operations). We can also use just (1) and (3).

Node API

The node should generate notification messages if it needs to use any of the fallbacks (the UI may choose not to display them prominently to user, as long as we didn't reach step (4) ).

The pause notification (we've reached step (4)) is urgent, so might justify a separate API callback. For case 4(b) the node could support an API call that manually sets the node time (and sets this time to be trusted). This API could also specify how long the manual time value should remain trusted (e.g., after x hours, the standard time sync check mechanism starts working again).

UI Questions

The most difficult problem we need to solve is that if the external time (NTP/HTTPS/system) doesn't match the median peer time, then all we know is that one of the following two bad events occurred:

a. The external time source is bad or b. A majority of the peers are bad.

Unfortunately, we can't distinguish these two cases. So we can't just tell the user to "fix their OS time" because in case (b) their OS time is fine (and the correct action is to drop the malicious peers).

As I see it, our choices in terms of UI are: (1) give up on trying to figure out which is the case, and just tell the user "we're paused until it sorts itself", or (2) we can allow the user to set/verify the correct external time manually (in which case we can take the appropriate action.)

Your point about mobile phones not showing seconds is a good one! I hadn't thought of that. So perhaps the initial UI can just use option (1).

dshulyak commented 2 years ago

Query NTP servers (from a fixed list that includes at least a few servers that should be available from every geographic location). Verify median time returned against median of peers. If they match --- we're done, we have our time. Otherwise,

I would rather remove this from spacemesh client. Every OS comes with an NTP daemon that can sync time reliably. Other popular blockchain clients (bitcoin, ethereum, zcash) require time to be weakly synchronized, at this point any operator must be familiar with this requirement. One google search is sufficient to understand how to deal with time synchronization on any OS. Also, on some OS there are options to run NTS clients (authenticated extension over NTP) that will provide additional security. Managing NTP time configuration on our own feels like a significant step back in terms of UX.

If we will remove it, users behind firewall won't have to adjust the list of NTP servers that spacemesh is using. I am pretty sure that in China they have servers inside the country that provide the correct time, and users already have them in system config.

We won't be able to notice that the time is incorrect if both system NTP and peers will give us the same, but invalid time. This should be possible only if we are using NTP without NTS, eclipsed and adversary controls ISP. But checking some (maybe the same) NTP servers ourselves will do nothing in this case.

As I see it, our choices in terms of UI are: (1) give up on trying to figure out which is the case, and just tell the user "we're paused until it sorts itself", or (2) we can allow the user to set/verify the correct external time manually (in which case we can take the appropriate action.)

I think we can trust user to verify that they have a correct local clock. It can be done using any popular website such as time.is. If it matches the local clock, they will hit a button (or send an API request) and it will notify spacemesh client that the issue is resolved on their end. If it doesn't match they will have to install or update the configuration for NTP daemon, and then hit a button.

The problem with just asking the user to sync the computer clock and tell smapp when they've done so is that users typically don't verify that their clock is synced with high accuracy. Consider an attack of the type described above, in which some ISP hijacks time server connections, and changes the time by only 30 seconds. In this case, a user who syncs the time probably won't notice that their time is incorrect (the OS often doesn't even display the seconds). If we ask the user to sync time and press "OK", we can't trust that the system time is accurate enough to decide which of our peers are sending us incorrect time values.

But we can force them to verify it, using a warning or the crash of the node as the last resort. It is not reasonable to expect that they will ignore everything, after all, if the user isn't interested why would they try to setup spacemesh client in the first place. OS indeed doesn't show time with seconds, but for technically savvy user it is not a problem to use a command line such as date. For smapp users maybe we can print time in the corner of the app.

antonlerner commented 2 years ago

I would rather remove this from spacemesh client. Every OS comes with an NTP daemon that can sync time reliably. Other popular blockchain clients (bitcoin, ethereum, zcash) require time to be weakly synchronized, at this point any operator must be familiar with this requirement. One google search is sufficient to understand how to deal with time synchronization on any OS. Also, on some OS there are options to run NTS clients (authenticated extension over NTP) that will provide additional security. Managing NTP time configuration on our own feels like a significant step back in terms of UX.

I agree with Dmitry on this one, I see no point in querying NTP server whereas all modern OSs do it themselves

I do think we can provide relevant error messages when we discover time is different and offer some mitigations for example, if we notice system clock is different than node clocks for a while we can put out an error message saying that the node is now paused from participating in consensus until time is synced. we can show system time and node time so that the user will see the difference and offer some remediations such as:

  1. Show user how to adjust system time on his os
  2. allow user to purge peers list and acquire a new one from bootstrap
tal-m commented 2 years ago

I would rather remove this from spacemesh client. Every OS comes with an NTP daemon that can sync time reliably. Other popular blockchain clients (bitcoin, ethereum, zcash) require time to be weakly synchronized, at this point any operator must be familiar with this requirement. One google search is sufficient to understand how to deal with time synchronization on any OS. Also, on some OS there are options to run NTS clients (authenticated extension over NTP) that will provide additional security. Managing NTP time configuration on our own feels like a significant step back in terms of UX.

If we will remove it, users behind firewall won't have to adjust the list of NTP servers that spacemesh is using. I am pretty sure that in China they have servers inside the country that provide the correct time, and users already have them in system config.

First, of all, I think this is a go-spacemesh issue, not a client issue. I don't see why the NTP config has to appear in the UX at all. (E.g., we can include both Chinese and non-Chinese NTP servers in the config, and the node will try all of them and use the ones that respond, with HTTPS and/or system time fallbacks as I suggested above). Most users won't know or care which mechanism is being used.

Using our own NTP client in go-spacemesh is very lightweight and would reduce the effect of OS-specific idiosyncrasies on our core code. I also think we can't trust the OS timezone to be correctly configured --- my guess is that anything that requires user intervention will be incorrectly set in at least 10% of the cases. (like 72% of statistics, of course, this number was pulled out of a hat, but I stand by the qualitative assessment). I'm also not sure that we're considering all the platforms on which people might want to run the node --- while modern desktop OSs usually do use network time servers, it makes sense for people to run a node on an old computer, or (in the future) even on something like a router. In these cases it's more likely that system time would be incorrect.

I do think we can provide relevant error messages when we discover time is different and offer some mitigations for example, if we notice system clock is different than node clocks for a while we can put out an error message saying that the node is now paused from participating in consensus until time is synced. we can show system time and node time so ...

Nice idea on the UX front! If we show both peer node and UX times with seconds, if there's a 30-second difference it will be clear to users that they need to check a second-accurate external time source.

avive commented 2 years ago

It is clear to me from the conversation on this SMIP and different ideas proposed in comments that we need to hash them out in order to get to a spec that is acceptable to all stakeholder that also considers the UX implications of this feature. Technically, this SMIP should be WIP and only be numbered and be ready for development once these issues have been sorted out and should not go to development before that. @antonlerner

dshulyak commented 2 years ago

When we were discussing this problem with @antonlerner the goal was to remove reliance on centralized servers from the go-spacemesh while preserving the same guarantees. Using two clock sources (system clock and peers clock) accomplishes this goal.

As i see the only point of contention is whether we should use a system clock or query NTP ourselves. The argument against using system time is:

For using system time:

First, of all, I think this is a go-spacemesh issue, not a client issue. I don't see why the NTP config has to appear in the UX at all. (E.g., we can include both Chinese and non-Chinese NTP servers in the config, and the node will try all of them and use the ones that respond, with HTTPS and/or system time fallbacks as I suggested above). Most users won't know or care which mechanism is being used.

i think the only way not to expose users to this problem is to have a trustless time synchronization protocol. with trust assumptions, we will have to rely on the user when the signals from sources are different. We will have to provide some UX for NTP configurations if we can't trust NTP . If we can - what's the point of multiple sources?

dshulyak commented 2 years ago

I guess if we agree that using (non-authenticated) centralized servers is not an issue then we can always use time from NTP servers, and fallback to system time/google HTTP server time only if all NTP servers are unreachable. This seems to be a weaker guarantee compared to collecting time from peers, as in this case adversary will be able to control node time by MITM on a local router for example.

antonlerner commented 2 years ago

It is clear to me from the conversation on this SMIP and different ideas proposed in comments that we need to hash them out in order to get to a spec that is acceptable to all stakeholder that also considers the UX implications of this feature. Technically, this SMIP should be WIP and only be numbered and be ready for development once these issues have been sorted out and should not go to development before that. @antonlerner

To my understanding, we always wanted to not rely on a single centralised source of time, this requirement has been brought up by research many times before and users have also asked about our decision to use centralised NTP and the security effect this might have on the nodes. We've had plans to replace NTP for a couple of years now with full support from research team. I agree we need to further think about how to display the different errors we might have in time sync, and this can be discussed further but in my opinion should not stop the development of the decentralised time protocol. What do you think? @avive @tal-m

lrettig commented 2 years ago

Bitcoin has a simple "network-adjusted time" protocol that works like this:

"Network-adjusted time" is the median of the timestamps returned by all nodes connected to you... Whenever a node connects to another node, it gets a UTC timestamp from it, and stores its offset from node-local UTC. The network-adjusted time is then the node-local UTC plus the median offset from all connected nodes. Network time is never adjusted more than 70 minutes from local system time, however. (Block timestamp)

What's wrong with doing the same thing, for now?

dshulyak commented 2 years ago

@lrettig we don't want to trust peers. bitcoin actually has a bug, where it will stop adjusting time when it collected 199 samples, and developers think that it is actually better not to remove that bug, because it will be very easy to manipulate the protocol. https://github.com/bitcoin/bitcoin/issues/4521

the proposed solution is simple (use two sources, and notify a user if they don't match). so it is not about the technical simplicity. it is about the UX, when time from NTP doesn't match the time from peers (or NTP is not available). this is probably will be very rare problem and not worth overthinking.

let's keep NTP in client and just allow user to disable it if doesn't work (firewall, better clock, centralized servers issues, or other unknowns). so if user prefers not to mess with NTP re-configuration he won't have too. at the same time for majority of users it will never be an issue.

lrettig commented 2 years ago

we don't want to trust peers

we already rely upon an honest-majority assumption in many, many places throughout the protocol. if we want to relax this assumption, we have much bigger issues to deal with than just time sync. I think it's fine to rely on majority of peers to report time honestly (e.g., taking median as Bitcoin does).

this is probably will be very rare problem and not worth overthinking.

agree, this is my point too - let's not spend too much time on this right now

I think the simplest path forward here is to keep doing what we're doing, i.e., rely upon system time and ask user to keep NTP sync turned on at the OS level, but perform the NTP time sync check more robustly than we currently do: check more NTP servers, and do the check many times (using exponential backoff) before failing.

dshulyak commented 2 years ago

we already rely upon an honest-majority assumption in many, many places throughout the protocol. if we want to relax this assumption, we have much bigger issues to deal with than just time sync.

i don't think that any protocol relies on a majority of peers anywhere, right? there is a difference between relying on majority of peers and majority of miners (smeshers). peers are cheap and you can't assume anything about who you are connected to.

agree, this is my point too - let's not spend too much time on this right now

it is not necessarily only about now. it is more about what is the appropriate solution in the long term. once it is known it can be implemented when there is a time for it.

doing both NTP check ourselves and asking user to configure a system NTP makes the least sense to me tbh. we don't gain much from doing this. and we already can use peers as a sanity check for system time.

lrettig commented 2 years ago

i don't think that any protocol relies on a majority of peers anywhere, right? there is a difference between relying on majority of peers and majority of miners (smeshers). peers are cheap and you can't assume anything about who you are connected to.

With a large enough set of peers to choose among (i.e., a very large number of synced, discoverable nodes), and random (or near-enough random) selection among them, I think mathematically this amounts to the same thing: if a majority of all nodes are honest, then with very high probability we are guaranteed to end up with a majority-honest set of peers. @tal-m can confirm.

doing both NTP check ourselves and asking user to configure a system NTP makes the least sense to me tbh. we don't gain much from doing this. and we already can use peers as a sanity check for system time.

So then let's continue to rely on system time (which should be synced using NTP) and use peers as a sanity check as you propose.

avive commented 2 years ago

I'd like to clarify the UX.

When node identifies a time problem (by using the new mechanism and peers) it should notify clients via our standard ErrorStream with a specific error status code so clients may handle this error.

Smapp Handling

When Smapp receives the error (identified by unique error code) it should present a modal dialog box (so users can't miss it) advising the user to check his OS clock and not just display it as a warning in the network screen - as user may not actively access this screen frequently.

The modal screen will have include a link that links to a page in the smapp guide (currently testnet guide) where we recommend how to fix common NTP problems.

Most users don't know that but in both Windows and macOS they can add multiple time servers instead of one to improve resiliency.

We will also recommend to check vs https://time.is/ - I wasn't familiar with this nice service until @dshulyak mentioned it above.

Another recommendation we'll display is for the user to compare his OS time against his mobile phone time. It is almost certain that the user has a mobile phone and that his mobile phone is getting its time from the GPS cell tower time and not his home ISP. As a side note, It is a bit hard to make iOS display seconds (I guess it wastes battery when the display needs to update every second) but possible - we will also explain how to do that.

The page will also display a list of common NTP servers including NTP servers that should work behind the firewall for Chinese users.

Other Clients Handling

When user is running the node directly as a separate process, he's responsible to monitor its operation with an API client and observe errors received via the API. In addition and as an example, we will add a feature to smrepl to subscribe to error stream and display errors received from node on this stream. In smrepl, when this error is received we will recommend users to read the info on the help page on the site by including an url in the error display.

peterbourgon commented 2 years ago

Spacemesh clients must have synchronized clocks in order to participate in the network.

As far as I am aware, it is literally impossible for nodes in a distributed system to have synchronized clocks, if the system should also maintain liveness and/or consistency guarantees. This is the reason Vector clocks et. al. were invented! How can your system ensure this invariant is upheld?