ros-infrastructure / ros_buildfarm

ROS buildfarm based on Docker
Apache License 2.0
83 stars 97 forks source link

Jobs timing out when trying to ssh at cleanup stage #466

Closed tfoote closed 7 years ago

tfoote commented 7 years ago

I've seen several hung jobs with this sort of error where it appears the ssh connection fails. I've seen it on

http://build.ros.org:8080/view/Queue/job/Kbin_dj_dJ64__grid_map_visualization__debian_jessie_amd64__binary/36/console

00:11:23.242 SSH: Connecting from host [ip-172-31-7-50]
00:11:23.245 SSH: Connecting with configuration [repo] ...
02:00:01.258 Build timed out (after 120 minutes). Marking the build as failed.
02:00:01.263 ERROR: null
02:00:01.263 Build step 'Send files or execute commands over SSH' changed build result to FAILURE
02:00:01.263 Build step 'Send files or execute commands over SSH' marked build as failure
02:00:01.264 $ ssh-agent -k
02:00:01.265 SSH: Caught exception [Failed to read file - filename [/var/lib/jenkins/.ssh/id_rsa] (relative to JENKINS_HOME if not absolute). Message: [java.lang.InterruptedException]] Sleeping for [5,000]ms before trying again
02:00:01.276 unset SSH_AUTH_SOCK;
02:00:01.276 unset SSH_AGENT_PID;
02:00:01.276 echo Agent pid 2392 killed;

http://build.ros.org:8080/view/Queue/job/Kbin_uX32__grid_map_octomap__ubuntu_xenial_i386__binary/1/console http://build.ros.org:8080/view/Queue/job/Kbin_uX32__grid_map_visualization__ubuntu_xenial_i386__binary/35/console http://build.ros.org:8080/view/Queue/job/Kbin_dj_dJ64__grid_map_rviz_plugin__debian_jessie_amd64__binary/40/console

Potentially all builds on this computer? http://build.ros.org:8080/computer/ip-172-31-7-50.us-west-1.compute.internal/

http://build.ros.org:8080/computer/ip-172-31-7-50.us-west-1.compute.internal/builds

image

nuclearsandwich commented 7 years ago

Disconnecting the agent and restarting the jenkins-slave service has apparently resolved the issue.

Once is a fluke, twice is a bug.