nuagenetworks / nuage-metroae

Nuage Networks Metro Automation Engine
http://devops.nuagenetworks.net
Apache License 2.0
44 stars 17 forks source link

VSD NTP sync timeout mechanism is unreliable #373

Closed jbemmel closed 6 years ago

jbemmel commented 7 years ago

I am unable to use Metro for a customer deployment, because in this environment NTP sync takes longer than Metro is willing to wait for. Installation of the VSDs fails, and the whole thing falls apart

Instead of waiting/retrying for a timeout, the script should wait indefinitely ( or very, very long ) before exiting. 4 retries with a timeout of 5 is not enough

ghost commented 7 years ago

@jbemmel what is very, very long? 5 minutes? An hour?

aenertia commented 7 years ago

Can you solve this by adding local clock fudge lines to the default deployed ntp.conf and adding tlsdate script to fetch time from https initially?

I.e

server 127.127.1.0 fudge 127.127.1.0 stratum 10

Tlsdate is on GitHub and I use it for customer environments which are heavily firewalled to provide accurate time source via any reachable https server.

On 11 Sep. 2017 2:56 pm, "Jeroen van Bemmel" notifications@github.com wrote:

I am unable to use Metro for a customer deployment, because in this environment NTP sync takes longer than Metro is willing to wait for. Installation of the VSDs fails, and the whole thing falls apart

Instead of waiting/retrying for a timeout, the script should wait indefinitely ( or very, very long ) before exiting. 4 retries with a timeout of 5 is not enough

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nuagenetworks/nuage-metro/issues/373, or mute the thread https://github.com/notifications/unsubscribe-auth/AARer2yWu58gILZ2B0o10D_norynUw7bks5shKFLgaJpZM4PSk6Q .

jbemmel commented 7 years ago

Yes, the fudge entries would "solve" it, but we'd have to remove them once the install completes. I thought about that, but didn't have time ( or appropriate setup ) to modify the Ansible scripts. Also, it would hide any issues with the NTP servers, it is better to detect this during install so I'm not a fan of completely bypassing the check

Have now changed my container to always copy the Ansible sources to a persistent subdir, such that expert users can modify them as needed

I think a simple change of 4 retries to 10 retries might do the trick, haven't had a chance to test what would work. NTP does sync eventually, ,just not within 20 seconds

On Mon, Sep 11, 2017 at 1:35 AM, Joel Wirāmu Pauling < notifications@github.com> wrote:

Can you solve this by adding local clock fudge lines to the default deployed ntp.conf and adding tlsdate script to fetch time from https initially?

I.e

server 127.127.1.0 fudge 127.127.1.0 stratum 10

Tlsdate is on GitHub and I use it for customer environments which are heavily firewalled to provide accurate time source via any reachable https server.

On 11 Sep. 2017 2:56 pm, "Jeroen van Bemmel" notifications@github.com wrote:

I am unable to use Metro for a customer deployment, because in this environment NTP sync takes longer than Metro is willing to wait for. Installation of the VSDs fails, and the whole thing falls apart

Instead of waiting/retrying for a timeout, the script should wait indefinitely ( or very, very long ) before exiting. 4 retries with a timeout of 5 is not enough

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nuagenetworks/nuage-metro/issues/373, or mute the thread https://github.com/notifications/unsubscribe- auth/AARer2yWu58gILZ2B0o10D_norynUw7bks5shKFLgaJpZM4PSk6Q .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nuagenetworks/nuage-metro/issues/373#issuecomment-328430396, or mute the thread https://github.com/notifications/unsubscribe-auth/AB8AC-Ay4tCPzh4by9bCGT56aamFEYRbks5shNS2gaJpZM4PSk6Q .

jbemmel commented 7 years ago

PR https://github.com/nuagenetworks/nuage-metro/pull/375

We found that enabling ntpdate on VSD was needed in this environment, to make NTP sync after VSD reboot. Patch also increases retries from 4 to 10

On Mon, Sep 11, 2017 at 8:30 AM, Jeroen van Bemmel jvb127@gmail.com wrote:

Yes, the fudge entries would "solve" it, but we'd have to remove them once the install completes. I thought about that, but didn't have time ( or appropriate setup ) to modify the Ansible scripts. Also, it would hide any issues with the NTP servers, it is better to detect this during install so I'm not a fan of completely bypassing the check

Have now changed my container to always copy the Ansible sources to a persistent subdir, such that expert users can modify them as needed

I think a simple change of 4 retries to 10 retries might do the trick, haven't had a chance to test what would work. NTP does sync eventually, ,just not within 20 seconds

On Mon, Sep 11, 2017 at 1:35 AM, Joel Wirāmu Pauling < notifications@github.com> wrote:

Can you solve this by adding local clock fudge lines to the default deployed ntp.conf and adding tlsdate script to fetch time from https initially?

I.e

server 127.127.1.0 fudge 127.127.1.0 stratum 10

Tlsdate is on GitHub and I use it for customer environments which are heavily firewalled to provide accurate time source via any reachable https server.

On 11 Sep. 2017 2:56 pm, "Jeroen van Bemmel" notifications@github.com wrote:

I am unable to use Metro for a customer deployment, because in this environment NTP sync takes longer than Metro is willing to wait for. Installation of the VSDs fails, and the whole thing falls apart

Instead of waiting/retrying for a timeout, the script should wait indefinitely ( or very, very long ) before exiting. 4 retries with a timeout of 5 is not enough

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nuagenetworks/nuage-metro/issues/373, or mute the thread https://github.com/notifications/unsubscribe-auth/ AARer2yWu58gILZ2B0o10D_norynUw7bks5shKFLgaJpZM4PSk6Q .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nuagenetworks/nuage-metro/issues/373#issuecomment-328430396, or mute the thread https://github.com/notifications/unsubscribe-auth/AB8AC-Ay4tCPzh4by9bCGT56aamFEYRbks5shNS2gaJpZM4PSk6Q .