osrf / srcsim

Space Robotics Challenge
Other
9 stars 4 forks source link

java.io.IOException: Too many open files #134

Closed osrf-migration closed 7 years ago

osrf-migration commented 7 years ago

Original report (archived issue) by dan (Bitbucket: dan77062).


If I do a lot of walking, for example, to the solar panel table, back away from that, and move to the solar panel electrical connection table, I get a java error and Val stops responding:

#!c++

publishing and latching message for 3.0 seconds
Apr 15, 2017 10:06:37 PM org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink
WARNING: Failed to accept a connection.
java.io.IOException: Too many open files
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
    at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:238)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
osrf-migration commented 7 years ago

Original comment by Louise Poubel (Bitbucket: chapulina, GitHub: chapulina).


osrf-migration commented 7 years ago

Original comment by Rud Merriam (Bitbucket: rmerriam).


I've seen this, also. It occurs when the simulation has been open for a long time with a lot of testing by sending messages to the robot. I think closing publishers and subscribers at the end of a run helps. But with testing you often terminate so the C++ destructors, in my case, that handle the closing aren't called.

osrf-migration commented 7 years ago

Original comment by dan (Bitbucket: dan77062).


I may have figured it out. When you have a web service with established connections that time out but the service has connection leaks that leave the connections open, then JVM eventually issues this exception. This is an especially bad bug for our competition because the low bandwidths will make systems especially vulnerable.

So the question now is this: If the run crashes due to a java.io.IOException: Too many open files, caused by a connection leak bug in the sim, whose fault is that? Do we get a free reset under the rule that if the sim crashes for reasons not caused by the team (in this case connection leaks in the sim code)?

osrf-migration commented 7 years ago

Original comment by dan (Bitbucket: dan77062).


With the isolated task 2 world, this error stops me from getting the solar panel onto the table with the connector. Even with optimizing the number of message calls, this error pops up after I have gone to the rover, picked up the panel, walked to the work table, and am about to put the panel on the table. Rebooting immediately prior to the run does not help.

So now it is a blocker to progress.

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


@dan77062 what are you referring to when you say "a web service with established connections that time out but the service has connection leaks that leave the connections open"? Are you implying there is a leak in the IHMC code? If so, we should report it at:

I can try to reproduce the bug if you can provide a script of commands. For example, would walking forward with one step per message in an infinite ground plane reproduce this bug?

osrf-migration commented 7 years ago

Original comment by dan (Bitbucket: dan77062).


I think that would probably produce the message. It happens for me after walking to the solar panel rover, turning, walking to the panel, doing some arm work, backing away, turning, walking to the electric panel area, turning, and just when I walk up to the electric panel, the message reliably appears and Val freezes. It doesn't seem to matter if I actually pick up the solar panel or not.

I don't know what is causing the messages, I was reading about linux traffic control and one page mentioned this error being associated with connection leaks, exacerbated by slow connections, so I thought I'd mention that possibility.

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


How are you sending messages? Are you using rospy or roscpp? It would be helpful to use similar code to try to reproduce this.

osrf-migration commented 7 years ago

Original comment by dan (Bitbucket: dan77062).


almost all are rospp, but some are rospy.

osrf-migration commented 7 years ago

Original comment by dan (Bitbucket: dan77062).


any progress on this? It remains a blocker for us in task 2

osrf-migration commented 7 years ago

Original comment by Jeremy White (Bitbucket: knitfoo).


Just want to chime in and say that I see what dan describes. This is entirely with python bindings. My tests may not be quite as pure; I had a few additional publish/subscribe loops in there. But I'm subscribing to a handful and publishing a handful, and I did that maybe 3 times. It seems highly improbable that my code actions exhausted a 1024 file descriptor limit.

osrf-migration commented 7 years ago

Original comment by Jeremy White (Bitbucket: knitfoo).


Actually, to be more precise. I'm seeing connection refused messages; I have not (yet) found the Java IO error. So I may be conflating two issues; it just seems suspiciously similar.

My connection refused messages look like this:

[rosmaster.master][INFO] 2017-04-29 13:46:59,581: publisherUpdate[/ihmc_ros/valkyrie/control/pelvis_orientation_trajectory] -> http://127.0.0.1:43115/ [rosmaster.threadpool][ERROR] 2017-04-29 13:46:59,582: Traceback (most recent call last): File "/opt/ros/indigo/lib/python2.7/dist-packages/rosmaster/threadpool.py", line 218, in run result = cmd(*args) File "/opt/ros/indigo/lib/python2.7/dist-packages/rosmaster/master_api.py", line 208, in publisher_update_task xmlrpcapi(api).publisherUpdate('/master', topic, pub_uris) File "/usr/lib/python2.7/xmlrpclib.py", line 1233, in call return self.send(self.name, args) File "/usr/lib/python2.7/xmlrpclib.py", line 1587, in request verbose=self.verbose File "/usr/lib/python2.7/xmlrpclib.py", line 1273, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib/python2.7/xmlrpclib.py", line 1301, in single_request self.send_content(h, request_body) File "/usr/lib/python2.7/xmlrpclib.py", line 1448, in send_content connection.endheaders(request_body) File "/usr/lib/python2.7/httplib.py", line 1013, in endheaders self._send_output(message_body) File "/usr/lib/python2.7/httplib.py", line 864, in _send_output self.send(msg) File "/usr/lib/python2.7/httplib.py", line 826, in send self.connect() File "/usr/lib/python2.7/httplib.py", line 807, in connect self.timeout, self.source_address) File "/usr/lib/python2.7/socket.py", line 571, in create_connection raise err error: [Errno 111] Connection refused

osrf-migration commented 7 years ago

Original comment by Rud Merriam (Bitbucket: rmerriam).


This may be interaction among the competition software and nothing to do with competitor's software. I was running some simple chest trajectory code sorting our TF2 stuff. Nothing of mine was running when I walked away for a few hours. Came back to find these errors. Nothing of mine was running.

osrf-migration commented 7 years ago

Original comment by dan (Bitbucket: dan77062).


In task 3, the same error stops me after going up the steps, through the door, and turned to go into the hab. So now it is a blocker for both task 2 and task 3.

osrf-migration commented 7 years ago

Original comment by Tom Tseemceeb Xiong (Bitbucket: Tom_Xiong).


Yep same my errors are generated over time. Not from running complex code.

osrf-migration commented 7 years ago

Original comment by Steven Gray (Bitbucket: stgray).


Seeing the same 'Failed to accept a connection' errors here. Unable to connect to any ros topics published by the robot at that point. I had been running for 1:43:00 real time, 48:30:00 sim time. I brought my system up and down about 20 times in that interval, so lots of subscribing.

May 07, 2017 11:10:34 PM org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink
WARNING: Failed to accept a connection.
java.io.IOException: Too many open files
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
    at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:238)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
osrf-migration commented 7 years ago

Original comment by Steven Gray (Bitbucket: stgray).


@scpeters Here's an easy way to reproduce the java IOException in under 30 seconds. I just make 6+ instances of the test node below, which subscribes to most of the Valkyrie topics. It doesn't matter if I just use rosrun and cause the previous instance to shutdown each time I start a new one, or if I make instances with anonymous ids. The exact number of instances seems to vary between test runs that I've tried. Sometimes it's as low as 6, others it could be 10 or more.

#include <ros/ros.h>
#include <std_msgs/Float64MultiArray.h>
#include <sensor_msgs/JointState.h>
#include <nav_msgs/Odometry.h>

#include <ihmc_msgs/FootstepDataListRosMessage.h>
#include <ihmc_msgs/HeadTrajectoryRosMessage.h>
#include <ihmc_msgs/PelvisHeightTrajectoryRosMessage.h>
#include <ihmc_msgs/WholeBodyTrajectoryRosMessage.h>
#include <ihmc_msgs/ArmTrajectoryRosMessage.h>
#include <ihmc_msgs/NeckTrajectoryRosMessage.h>
#include <ihmc_msgs/StopAllTrajectoryRosMessage.h>
#include <ihmc_msgs/PauseWalkingRosMessage.h>
#include <ihmc_msgs/AbortWalkingRosMessage.h>
#include <ihmc_msgs/GoHomeRosMessage.h>
#include <ihmc_msgs/FootstepStatusRosMessage.h>
#include <ihmc_msgs/HighLevelStateRosMessage.h>
#include <ihmc_msgs/WalkingStatusRosMessage.h>

void jsCallback(const sensor_msgs::JointStateConstPtr &js){};
void fsCallback(const ihmc_msgs::FootstepStatusRosMessageConstPtr &fs){};
void hlsCallback(const ihmc_msgs::HighLevelStateRosMessageConstPtr &hls){};
void pCallback(const nav_msgs::OdometryConstPtr &p){};
void wsCallback(const ihmc_msgs::WalkingStatusRosMessage &ws){};

int main(int argc, char** argv) {

  ros::init(argc, argv, "val_network_test");
  ros::NodeHandle nh;

  // Advertise topics
  ros::Publisher footstep_pub_ = nh.advertise<ihmc_msgs::FootstepDataListRosMessage>("/ihmc_ros/valkyrie/control/footstep_list", 1);
  ros::Publisher head_traj_pub_ = nh.advertise<ihmc_msgs::HeadTrajectoryRosMessage>( "/ihmc_ros/valkyrie/control/head_trajectory", 1);
  ros::Publisher pelvis_height_pub_ = nh.advertise<ihmc_msgs::PelvisHeightTrajectoryRosMessage >( "/ihmc_ros/valkyrie/control/pelvis_height_trajectory", 1);
  ros::Publisher pelvis_traj_pub_ = nh.advertise<ihmc_msgs::PelvisTrajectoryRosMessage>( "/ihmc_ros/valkyrie/control/pelvis_trajectory", 1);
  ros::Publisher foot_traj_pub_ = nh.advertise<ihmc_msgs::FootTrajectoryRosMessage>( "/ihmc_ros/valkyrie/control/foot_trajectory", 1);
  ros::Publisher hand_traj_pub_ = nh.advertise<ihmc_msgs::HandTrajectoryRosMessage >( "/ihmc_ros/valkyrie/control/hand_trajectory", 1);
  ros::Publisher chest_traj_pub_ = nh.advertise<ihmc_msgs::ChestTrajectoryRosMessage>( "/ihmc_ros/valkyrie/control/chest_trajectory", 1);
  ros::Publisher wholebody_traj_pub_ = nh.advertise<ihmc_msgs::WholeBodyTrajectoryRosMessage>( "/ihmc_ros/valkyrie/control/whole_body_trajectory", 1);
  ros::Publisher arm_traj_pub_ = nh.advertise<ihmc_msgs::ArmTrajectoryRosMessage>( "/ihmc_ros/valkyrie/control/arm_trajectory", 1);
  ros::Publisher neck_traj_pub_ = nh.advertise<ihmc_msgs::NeckTrajectoryRosMessage>( "/ihmc_ros/valkyrie/control/neck_trajectory", 1);
  ros::Publisher stop_all_pub_ = nh.advertise<ihmc_msgs::StopAllTrajectoryRosMessage >( "/ihmc_ros/valkyrie/control/stop_all_trajectories", 1);
  ros::Publisher walk_pause_pub_ = nh.advertise<ihmc_msgs::PauseWalkingRosMessage>( "/ihmc_ros/valkyrie/control/pause_walking", 1);
  ros::Publisher walk_abort_pub_ = nh.advertise<ihmc_msgs::AbortWalkingRosMessage>( "/ihmc_ros/valkyrie/control/abort_walking", 1);
  ros::Publisher go_home_pub_ = nh.advertise<ihmc_msgs::GoHomeRosMessage>( "/ihmc_ros/valkyrie/control/go_home", 1);
  ros::Publisher left_hand_pub_ = nh.advertise<std_msgs::Float64MultiArray>( "/left_hand_position_controller/command", 1);
  ros::Publisher right_hand_pub_ = nh.advertise<std_msgs::Float64MultiArray>( "/right_hand_position_controller/command", 1);

  ros::Subscriber joint_state_sub_ = nh.subscribe("/ihmc_ros/valkyrie/output/joint_states", 1, jsCallback);
  ros::Subscriber footstep_sub_ = nh.subscribe("/ihmc_ros/valkyrie/output/footstep_status", 1, fsCallback);
  ros::Subscriber state_sub_ = nh.subscribe("/ihmc_ros/valkyrie/output/high_level_state", 1, hlsCallback);
  ros::Subscriber pose_sub_ = nh.subscribe("/ihmc_ros/valkyrie/output/robot_pose", 1, pCallback);
  ros::Subscriber walk_status_sub_ = nh.subscribe("/ihmc_ros/valkyrie/output/walking_status", 1, wsCallback);

  ros::AsyncSpinner spinner(2);
  spinner.start();
  ros::waitForShutdown();
}
osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


@stgray thanks for the example code! I added ros::init_options::AnonymousName to the ros::init call so I could more easily open multiple instances of the node. I've opened about 20 instances of the node and haven't noticed any IOExceptions yet.

The only weird thing I've seen was before I added AnonymousName, I had tried opening multiple copies of the node, but each one that I opened would shut down the others:

[ WARN] [1494525173.813713585, 57.514000000]: Shutdown request received.
[ WARN] [1494525173.813810064, 57.514000000]: Reason given for shutdown: [new node registered with same name]

After doing that a few times, I noticed the following console error:

May 11, 2017 5:53:30 PM org.jboss.netty.util.internal.SharedResourceMisuseDetector
WARNING: You are creating too many HashedWheelTimer instances.  HashedWheelTimer is a shared resource that must be reused across the application, so that only a few instances are created.

but I haven't seen any consequences of that yet.

osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


I have noticed problems when opening too many of those nodes. Sometimes my footstep commands get ignored, and when I opened way too many, I couldn't even do a rostopic echo. I still haven't seen the IOException yet though.

osrf-migration commented 7 years ago

Original comment by Steven Gray (Bitbucket: stgray).


@scpeters

I just reproduced the issue on a second machine with SRCSim. Here's video of it happening. Note that I had to run the launch file twice -- started once, killed it, and restarted. The IOExceptions are in the terminal that I launch the sim, not the one where I launch my other code. Note that this is with a fresh version of the sim started and no other code running. Here's the launch file I use.

<?xml version="1.0" standalone="yes"?>
<launch>
  <node name="$(anon r5_comms1)" pkg="ws_controls" type="network_stress" />
  <node name="$(anon r5_comms2)" pkg="ws_controls" type="network_stress" />
  <node name="$(anon r5_comms3)" pkg="ws_controls" type="network_stress" />
  <node name="$(anon r5_comms4)" pkg="ws_controls" type="network_stress" />
  <node name="$(anon r5_comms5)" pkg="ws_controls" type="network_stress" />
  <node name="$(anon r5_comms6)" pkg="ws_controls" type="network_stress" />
  <node name="$(anon r5_comms7)" pkg="ws_controls" type="network_stress" />
  <node name="$(anon r5_comms8)" pkg="ws_controls" type="network_stress" />
</launch>
osrf-migration commented 7 years ago

Original comment by Jeremy White (Bitbucket: knitfoo).


I did some naive forensics on this a while back, and a stock Ubuntu system has a hard file limit of 4096. I noticed that one of the Java processes would fork in such a way that a pair of the Java processes would end up with a limit of only 2048 file handles. Further, it's 'startup' state was about 1800, and it would fairly quickly rise above that, and presumably run out.

I've added a .conf file to /etc/security/limits.d/ that contains: @ros hard nofile 8192 @ros soft nofile 8192

After that, the two Java processes end up with 4096 each. And since making that change, I've broadly had pretty good stability on the longer runs.

That could just be coincidence (I've made other changes), but it could be useful. It certainly shouldn't hurt anything to add that to the srcsim image.

osrf-migration commented 7 years ago

Original comment by Steven Gray (Bitbucket: stgray).


@knitfoo thanks for the suggestion. I tried it but can still replicate the issue. I modified /etc/security/limits.d/ros-rtprio.conf from step 11 of the setup guide so it looks like:

@ros - rtprio 99
@ros hard nofile 8192
@ros soft nofile 8192

and verified that ulimit -n showed 8192. I then ran my test script again and (after somewhere between 8-16 instances) got a number of new java exceptions, plus the same 'too many open files' exception.

May 11, 2017 11:30:45 PM org.jboss.netty.channel.DefaultChannelPipeline
WARNING: An exception was thrown by a user handler while handling an exception event ([id: 0x0868b217] EXCEPTION: org.jboss.netty.channel.ChannelException: Failed to create a selector.)
org.ros.exception.RosRuntimeException: org.jboss.netty.channel.ChannelException: Failed to create a selector.
    at org.ros.internal.transport.ConnectionTrackingHandler.exceptionCaught(ConnectionTrackingHandler.java:81)
    at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:533)
    at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:166)
    at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:111)
    at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:55)
    at org.jboss.netty.channel.Channels.connect(Channels.java:642)
    at org.jboss.netty.channel.AbstractChannel.connect(AbstractChannel.java:207)
    at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:230)
    at org.jboss.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:183)
    at org.ros.internal.transport.tcp.TcpClient.connect(TcpClient.java:101)
    at org.ros.internal.transport.tcp.TcpClientManager.connect(TcpClientManager.java:69)
    at org.ros.internal.node.topic.DefaultSubscriber.addPublisher(DefaultSubscriber.java:150)
    at org.ros.internal.node.topic.UpdatePublisherRunnable.run(UpdatePublisherRunnable.java:73)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.jboss.netty.channel.ChannelException: Failed to create a selector.
    at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.register(NioClientSocketPipelineSink.java:212)
    at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:161)
    ... 13 more
Caused by: java.io.IOException: Too many open files
    at sun.nio.ch.IOUtil.makePipe(Native Method)
    at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:65)
    at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
    at java.nio.channels.Selector.open(Selector.java:227)
    at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.register(NioClientSocketPipelineSink.java:210)
    ... 14 more

May 11, 2017 11:30:46 PM org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink
WARNING: Failed to accept a connection.
java.io.IOException: Too many open files
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
    at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:238)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
osrf-migration commented 7 years ago

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


I've submitted an upstream issue, since the ros API launch file causes over 1.5 million files to be opened by java on my machine, and I'm not sure why.

osrf-migration commented 7 years ago

Original comment by Steven Gray (Bitbucket: stgray).


Yikes, good catch. Just verified this on my machine with lsof | wc -l

osrf-migration commented 7 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


This issue is out of our hands, and won't be resolved by the final competition. This is unfortunate, and we are sorry about this problem.

Rather than attempt to work around it by changing the competition structure, we will reset teams whenever this (or other problems) arise during the finals.

I understand the frustration that this may cause, and I hope that everyone can work with us over the next two weeks.

osrf-migration commented 7 years ago

Original comment by Víctor López (Bitbucket: Victor Lopez).


Rather than attempt to work around it by changing the competition structure, we will reset teams whenever this (or other problems) arise during the finals.

Without a penalty score right? What about the time?

Also this might be the cause of some of the performance problems. I see a lot of system CPU usage when running the simulation.

osrf-migration commented 7 years ago

Original comment by dan (Bitbucket: dan77062).


I know this is wontfix but I've been running 0.6 locally and I think this is happening more often than it was before. I can run the sim about 15 minutes in general before seeing the error. Has anyone else noticed an uptick in this error? Cloudsim is still 0.5 (or was yesterday) so unclear if it will affect that.

osrf-migration commented 7 years ago

Original comment by Jeremy White (Bitbucket: knitfoo).


I don't think we've seen an uptick, but we haven't been running against the cloud image much. I haven't thought to look if the cloud sim images have an override in /etc/security/limits.d/ to help provide us a buffer. Anyone checked on that?

osrf-migration commented 7 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


There will be no penalty in score or time when this occurs. All log files will be kept so that we (and you) can review the final scores.

Note that the scores that appear on cloudsim will not be accurate after a reset.

osrf-migration commented 7 years ago

Original comment by Balazs Kollar (Bitbucket: BalazsKollar).


I think there's a silent variant of this error, but I'm only guessing. When I kill my python code with ctrl-c, it looks like it's leaking subscribers/publishers and after starting/killing it 5-6 times, it will not get a connection and it will be "waiting for a subscriber" forever. If I create a shutdown hook and unregister() everyting before exiting, I will get the too many open files error after the same 5-6 tries and the sim will be unresponsive just like in the first case. Of course I might be misunderstanding something, but this might be important if a team has to request a reset because the sim is not responding.