Production setup -Devices show up on main page, but accessing them shows a grey page and cannot perform any action on it.

sj2208 commented 8 years ago

@sorccu

I have deployed stf in production mode on coreOS. After starting all the services all devices start showing and are accessible. Now after 24 hours or so, few of the devices only show on screen and rest go in disconnected state ( Although the devices are connected to the same machines ). When i access a device in this state it shows up grey screen and on performing any action ( take screenshot ) it throws error. Have attached the screen-shots.

When i restart the services, all devices start showing up again and are accessible. Can you please guide me where can be the issue ?

sorccu commented 8 years ago

Looks like you've changed the STF logo (visible before you edited), meaning that you've modified the source code. Could be an issue you've created yourself.

I would suspect incorrect configuration.

sj2208 commented 8 years ago

The changes was only the logo to check and deploy the changes with docker. At the time of starting units all devices starts showing up correctly. But after some time they go in some zombie state.

vbanthia-zz commented 8 years ago

Most probably its related to TCP Connection timeout. It depends on how your local network is configured.

One solution to solve this is by setting ZMQ_TCP_KEEPALIVE & ZMQ_TCP_KEEPALIVE_IDLE environment variable to your providers.

sj2208 commented 8 years ago

@vbanthia - I have to pass them to providers only ? Is below given unit file looks okay ?

[Unit] Description=STF app After=rethinkdb-proxy-28015.service BindsTo=rethinkdb-proxy-28015.service

[Service] EnvironmentFile=/etc/environment TimeoutStartSec=0 Restart=always ExecStartPre=/usr/bin/docker pull openstf/stf:latest ExecStartPre=-/usr/bin/docker kill %p-%i ExecStartPre=-/usr/bin/docker rm %p-%i ExecStart=/usr/bin/docker run --rm \ --name %p-%i \ --link rethinkdb-proxy-28015:rethinkdb \ -e "SECRET=YOUR_SESSION_SECRET_HERE" \ -e "ZMQ_TCP_KEEPALIVE =1" \ -e "ZMQ_TCP_KEEPALIVE_IDLE =30000" \ -p %i:3000 \ openstf/stf:latest \ stf app --port 3000 \ --auth-url https://stf.example.org/auth/mock/ \ --websocket-url https://stf.example.org/ ExecStop=-/usr/bin/docker stop -t 10 %p-%i

vbanthia-zz commented 8 years ago

If you are running all other stf micro services such as (stf-app, stf-auth ...) in same machine then you only need to add these variable in provider. If you are using CoreOS + Fleet then they might be running on different machines then you will have to add these variables in other services(which are using zmq socket) too.

Basically you will have to make sure that all the TCP connections between your servers do not die.

Unit file looks okay to me.

vbanthia-zz commented 8 years ago

looks like you have added some extra spaces in -e "ZMQ_TCP_KEEPALIVE =1". It might not work because of that space. Change it to -e "ZMQ_TCP_KEEPALIVE =1" -> -e "ZMQ_TCP_KEEPALIVE=1"

Also, always write source code in markdown code block for better visibility.

sj2208 commented 8 years ago

Thanks @vbanthia I am using COREOS + fleet

I have made the following changes -e "ZMQ_TCP_KEEPALIVE=1" \ -e "ZMQ_TCP_KEEPALIVE_IDLE=30000" \ to the below unit-files stf-triproxy-app stf-triproxy-dev stf-websocket stf-provider@ stf-processor@

The setup works absolutely fine. But when i change the value "ZMQ_TCP_KEEPALIVE_IDLE=30000" to "ZMQ_TCP_KEEPALIVE_IDLE=600000" it starts failing with the error in all above unit files with Invalid argument (tcp.cpp:121)

Any work around or is there any max limit to the value ?

sj2208 commented 8 years ago

@vbanthia - please check if u get some time

mitchtech commented 8 years ago

Same issue here on my production deployment. Over time (usually a few hours), devices become inaccessible through the Control screen, but still show up as accessible in the main Devices list. This issue is repeatable on both the master and v2.0.0 versions.

Adding the ZMQ_TCP_KEEPALIVE=1 and ZMQ_TCP_KEEPALIVE_IDLE=30000 environment variables to the systemd unit files did not seem to remedy the situation.

Not sure if this helps to diagnose the issue, but restarting only the provider services is sufficient for me to fully restore device access, but only to have the timeout occur again several hours later.

vbanthia-zz commented 8 years ago

By definition TCP_KEEPIDLE is

The time (in seconds) the connection needs to remain idle before TCP starts sending keepalive probes, if the socket option SO_KEEPALIVE has been set on this socket.

Can you guys try again with ZMQ_TCP_KEEPALIVE_IDLE=300

mitchtech commented 8 years ago

Looking good so far passing 6 hours uptime with ZMQ_TCP_KEEPALIVE_IDLE=300.

By now, without any direct interaction with STF, I would expect to see at least one provider start to become unresponsive. I'll follow up with on-going feedback since the issue intermittently took as long as ~48 hours idle to occur, but I expect this resolves the issue for me, thanks!

sj2208 commented 7 years ago

Tried with value ZMQ_TCP_KEEPALIVE_IDLE=300 and it failed after approx 8 hrs.

mitchtech commented 7 years ago

Four days uptime, all devices and providers still responsive. This definitely fixed the issue for me!

@sj2208 if the connection is still failing, I would try further reducing the value of ZMQ_TCP_KEEPALIVE_IDLE. Presumably the connection fails due to idle timeout; this will reduce the time period before the keepalive probes will start to fire to maintain the connection.

sj2208 commented 7 years ago

Yep this is working. Thank you guys :) 👍 @vbanthia @mitchtech @sorccu

deg0nz commented 5 years ago

Just want to confirm here, that this solved our issue too. Device providers are online for a loooong time now :)

FRodero commented 4 years ago

Hi!

How could I add environment variables on my STF system? Could you help me?

Thanks in advance!

openstf / stf

Production setup -Devices show up on main page, but accessing them shows a grey page and cannot perform any action on it. #342