Closed sj2208 closed 7 years ago
Looks like you've changed the STF logo (visible before you edited), meaning that you've modified the source code. Could be an issue you've created yourself.
I would suspect incorrect configuration.
The changes was only the logo to check and deploy the changes with docker. At the time of starting units all devices starts showing up correctly. But after some time they go in some zombie state.
Most probably its related to TCP Connection timeout. It depends on how your local network is configured.
One solution to solve this is by setting ZMQ_TCP_KEEPALIVE
& ZMQ_TCP_KEEPALIVE_IDLE
environment variable to your providers.
@vbanthia - I have to pass them to providers only ? Is below given unit file looks okay ?
[Unit] Description=STF app After=rethinkdb-proxy-28015.service BindsTo=rethinkdb-proxy-28015.service
[Service] EnvironmentFile=/etc/environment TimeoutStartSec=0 Restart=always ExecStartPre=/usr/bin/docker pull openstf/stf:latest ExecStartPre=-/usr/bin/docker kill %p-%i ExecStartPre=-/usr/bin/docker rm %p-%i ExecStart=/usr/bin/docker run --rm \ --name %p-%i \ --link rethinkdb-proxy-28015:rethinkdb \ -e "SECRET=YOUR_SESSION_SECRET_HERE" \ -e "ZMQ_TCP_KEEPALIVE =1" \ -e "ZMQ_TCP_KEEPALIVE_IDLE =30000" \ -p %i:3000 \ openstf/stf:latest \ stf app --port 3000 \ --auth-url https://stf.example.org/auth/mock/ \ --websocket-url https://stf.example.org/ ExecStop=-/usr/bin/docker stop -t 10 %p-%i
If you are running all other stf micro services such as (stf-app, stf-auth ...) in same machine then you only need to add these variable in provider. If you are using CoreOS + Fleet then they might be running on different machines then you will have to add these variables in other services(which are using zmq socket) too.
Basically you will have to make sure that all the TCP connections between your servers do not die.
Unit file looks okay to me.
looks like you have added some extra spaces in -e "ZMQ_TCP_KEEPALIVE =1"
. It might not work because of that space. Change it to -e "ZMQ_TCP_KEEPALIVE =1"
-> -e "ZMQ_TCP_KEEPALIVE=1"
Also, always write source code in markdown code block for better visibility.
Thanks @vbanthia I am using COREOS + fleet
I have made the following changes -e "ZMQ_TCP_KEEPALIVE=1" \ -e "ZMQ_TCP_KEEPALIVE_IDLE=30000" \ to the below unit-files stf-triproxy-app stf-triproxy-dev stf-websocket stf-provider@ stf-processor@
The setup works absolutely fine. But when i change the value "ZMQ_TCP_KEEPALIVE_IDLE=30000" to "ZMQ_TCP_KEEPALIVE_IDLE=600000" it starts failing with the error in all above unit files with Invalid argument (tcp.cpp:121)
Any work around or is there any max limit to the value ?
@vbanthia - please check if u get some time
Same issue here on my production deployment. Over time (usually a few hours), devices become inaccessible through the Control screen, but still show up as accessible in the main Devices list. This issue is repeatable on both the master and v2.0.0 versions.
Adding the ZMQ_TCP_KEEPALIVE=1
and ZMQ_TCP_KEEPALIVE_IDLE=30000
environment variables to the systemd unit files did not seem to remedy the situation.
Not sure if this helps to diagnose the issue, but restarting only the provider services is sufficient for me to fully restore device access, but only to have the timeout occur again several hours later.
By definition TCP_KEEPIDLE is
The time (in seconds) the connection needs to remain idle before TCP starts sending keepalive probes, if the socket option SO_KEEPALIVE has been set on this socket.
Can you guys try again with ZMQ_TCP_KEEPALIVE_IDLE=300
Looking good so far passing 6 hours uptime with ZMQ_TCP_KEEPALIVE_IDLE=300
.
By now, without any direct interaction with STF, I would expect to see at least one provider start to become unresponsive. I'll follow up with on-going feedback since the issue intermittently took as long as ~48 hours idle to occur, but I expect this resolves the issue for me, thanks!
Tried with value ZMQ_TCP_KEEPALIVE_IDLE=300 and it failed after approx 8 hrs.
Four days uptime, all devices and providers still responsive. This definitely fixed the issue for me!
@sj2208 if the connection is still failing, I would try further reducing the value of ZMQ_TCP_KEEPALIVE_IDLE
. Presumably the connection fails due to idle timeout; this will reduce the time period before the keepalive probes will start to fire to maintain the connection.
Yep this is working. Thank you guys :) 👍 @vbanthia @mitchtech @sorccu
Just want to confirm here, that this solved our issue too. Device providers are online for a loooong time now :)
Hi!
How could I add environment variables on my STF system? Could you help me?
Thanks in advance!
@sorccu
I have deployed stf in production mode on coreOS. After starting all the services all devices start showing and are accessible. Now after 24 hours or so, few of the devices only show on screen and rest go in disconnected state ( Although the devices are connected to the same machines ). When i access a device in this state it shows up grey screen and on performing any action ( take screenshot ) it throws error. Have attached the screen-shots.
When i restart the services, all devices start showing up again and are accessible. Can you please guide me where can be the issue ?