ros2 / build_farmer

0 stars 0 forks source link

macOS mini1 and mini2 not staying online #258

Closed nuclearsandwich closed 4 years ago

nuclearsandwich commented 4 years ago

During the week of ROSCon, we wiped the CI nodes mini1 and mini2 and installed macOS Mojave. Since the reinstall neither machine has reliabily stayed on the CI cluster.

Oct 31. Mini1 experienced up to 5% packet loss compared to my workstation on the other end of the office which I thought could be related to the connection reliability but since then I've relocated mini1 and the packet loss is down but the disconnects continue.

Since then I've tried

Before each connection failure in the agent logs is a java.lang.NoClassDefFoundError for either hudson/util/ProcessTree or jenkins/util/java/JavaUtils but browsing Jenkins issues hasn't yielded paydirt yet.

j-rivero commented 4 years ago

My first try was to take the machine out from the ROS buildfarm and connect it to the build.osrfoundation.org to see if we get more information from it. I've launched the agest directly through the command line using the java -jar agent.jar ... invocation.

Unfortunately the errors persisted in the same way. I was unable to see in any log a signal of what could be the root cause of wrong.

rotu commented 4 years ago

Is this an example of the type of failure due to the server going offline?

https://ci.ros2.org/job/ci_osx/8271/console

15:31:50       Start  6: xmllint
15:31:50 
15:31:50 6: Test command: /Users/osrf/jenkins-agent/workspace/ci_osx/venv/bin/python3 "-u" "/Users/osrf/jenkins-agent/workspace/ci_osx/ws/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/Users/osrf/jenkins-agent/workspace/ci_osx/ws/build/rmw/test_results/rmw/xmllint.xunit.xml" "--package-name" "rmw" "--output-file" "/Users/osrf/jenkins-agent/workspace/ci_osx/ws/build/rmw/ament_xmllint/xmllint.txt" "--command" "/Users/osrf/jenkins-agent/workspace/ci_osx/ws/install/ament_xmllint/bin/ament_xmllint" "--xunit-file" "/Users/osrf/jenkins-agent/workspace/ci_osx/ws/build/rmw/test_results/rmw/xmllint.xunit.xml"
15:31:50 6: Test timeout computed to be: 60
15:31:50 6: -- run_test.py: invoking following command in '/Users/osrf/jenkins-agent/workspace/ci_osx/ws/src/ros2/rmw/rmw':
15:31:50 6:  - /Users/osrf/jenkins-agent/workspace/ci_osx/ws/install/ament_xmllint/bin/ament_xmllint --xunit-file /Users/osrf/jenkins-agent/workspace/ci_osx/ws/build/rmw/test_results/rmw/xmllint.xunit.xml
15:32:08 FATAL: command execution failed
15:32:08 java.nio.channels.ClosedChannelException
15:32:08    at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
15:32:08    at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:221)
15:32:08    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
15:32:08    at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
15:32:08    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
15:32:08    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
15:32:08    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
15:32:08    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
15:32:08    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
15:32:08    at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
15:32:08    at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:784)
15:32:08    at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:172)
15:32:08    at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
15:32:08    at hudson.remoting.Channel.close(Channel.java:1450)
15:32:08    at hudson.remoting.Channel.close(Channel.java:1403)
15:32:08    at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:824)
15:32:08    at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:107)
15:32:08    at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:733)
15:32:08    at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
15:32:08    at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
15:32:08    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
15:32:08    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
15:32:08    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
15:32:08    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
15:32:08    at java.lang.Thread.run(Thread.java:748)
15:32:08 Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from 70-35-50-58.static.wiline.com/70.35.50.58:49667' is disconnected.
rotu commented 4 years ago

Here's another build that looks like wonky network stuff

https://ci.ros2.org/job/ci_osx/8347/consoleText

---
Finished <<< rosgraph_msgs [1min 13s]
]0;colcon build [151/293 done] [3 ongoing]Starting >>> std_msgs
]0;colcon build [151/293 done] [4 ongoing]--- output: zstd_vendor
Not searching for unused variables given on the command line.
-- The C compiler identification is AppleClang 9.0.0.9000039
-- The CXX compiler identification is AppleClang 9.0.0.9000039
-- Check for working C compiler: /usr/local/opt/ccache/libexec/cc
-- Check for working C compiler: /usr/local/opt/ccache/libexec/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/local/opt/ccache/libexec/c++
-- Check for working CXX compiler: /usr/local/opt/ccache/libexec/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found ament_cmake: 0.8.1 (/Users/osrf/jenkins-agent/workspace/ci_osx/ws/install/ament_cmake/share/ament_cmake/cmake)
-- Found PythonInterp: /Users/osrf/jenkins-agent/workspace/ci_osx/venv/bin/python3 (found suitable version "3.7.6", minimum required is "3") 
-- Using PYTHON_EXECUTABLE: /Users/osrf/jenkins-agent/workspace/ci_osx/venv/bin/python3
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/osrf/jenkins-agent/workspace/ci_osx/ws/build/zstd_vendor
Scanning dependencies of target zstd-1.4.4
[ 12%] Creating directories for 'zstd-1.4.4'
[ 25%] Performing download step (download, verify and extract) for 'zstd-1.4.4'
-- Downloading...
   dst='/Users/osrf/jenkins-agent/workspace/ci_osx/ws/build/zstd_vendor/zstd-1.4.4-prefix/src/v1.4.4.zip'
   timeout='60 seconds'
-- Using src='https://github.com/facebook/zstd/archive/v1.4.4.zip'
-- [download 100% complete]
-- Retrying...
-- Using src='https://github.com/facebook/zstd/archive/v1.4.4.zip'
-- [download 100% complete]
-- Retry after 5 seconds (attempt #2) ...
-- Using src='https://github.com/facebook/zstd/archive/v1.4.4.zip'
-- [download 100% complete]
-- [download 0% complete]

-- [download 1% complete]
-- [download 2% complete]
-- [download 3% complete]
-- [download 4% complete]
-- [download 5% complete]
-- [download 6% complete]
-- [download 7% complete]
-- [download 8% complete]
-- [download 9% complete]
-- Retry after 5 seconds (attempt #3) ...
-- Using src='https://github.com/facebook/zstd/archive/v1.4.4.zip'
-- [download 100% complete]
-- [download 0% complete]
-- [download 1% complete]
-- [download 2% complete]
-- [download 3% complete]
-- [download 4% complete]
-- [download 5% complete]
-- [download 6% complete]
-- [download 7% complete]
-- [download 8% complete]
-- [download 9% complete]
-- [download 10% complete]
-- [download 11% complete]
-- [download 12% complete]
-- [download 13% complete]
-- Retry after 15 seconds (attempt #4) ...
-- Using src='https://github.com/facebook/zstd/archive/v1.4.4.zip'
-- [download 0% complete]
-- [download 1% complete]
-- [download 2% complete]
-- [download 3% complete]
-- [download 4% complete]
-- [download 5% complete]
-- [download 6% complete]
-- [download 7% complete]
-- [download 8% complete]
-- [download 9% complete]
-- [download 10% complete]
-- Retry after 60 seconds (attempt #5) ...
-- Using src='https://github.com/facebook/zstd/archive/v1.4.4.zip'
-- [download 100% complete]
-- [download 0% complete]
-- [download 1% complete]
-- [download 2% complete]
-- [download 3% complete]
-- [download 4% complete]
-- [download 5% complete]
-- [download 6% complete]
-- [download 7% complete]
-- [download 8% complete]
-- [download 9% complete]
-- [download 10% complete]
CMake Error at zstd-1.4.4-stamp/download-zstd-1.4.4.cmake:159 (message):
  Each download failed!

    error: downloading 'https://github.com/facebook/zstd/archive/v1.4.4.zip' failed
         status_code: 28
         status_string: "Timeout was reached"
         log:
         --- LOG BEGIN ---
           Trying 192.30.255.112...

  TCP_NODELAY set

  Connected to github.com (192.30.255.112) port 443 (#0)

  ALPN, offering http/1.1

  Cipher selection:
  ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH

  successfully set certificate verify locations:

    CAfile: /etc/ssl/cert.pem
    CApath: none

  TLSv1.2 (OUT), TLS handshake, Client hello (1):

  [213 bytes data]

  TLSv1.2 (IN), TLS handshake, Server hello (2):

  [108 bytes data]

  TLSv1.2 (IN), TLS handshake, Certificate (11):

  [3085 bytes data]

  TLSv1.2 (IN), TLS handshake, Server key exchange (12):

  [300 bytes data]

  TLSv1.2 (IN), TLS handshake, Server finished (14):

  [4 bytes data]

  TLSv1.2 (OUT), TLS handshake, Client key exchange (16):

  [37 bytes data]

  TLSv1.2 (OUT), TLS change cipher, Client hello (1):

  [1 bytes data]

  TLSv1.2 (OUT), TLS handshake, Finished (20):

  [16 bytes data]

  TLSv1.2 (IN), TLS change cipher, Client hello (1):

  [1 bytes data]

  TLSv1.2 (IN), TLS handshake, Finished (20):

  [16 bytes data]

  SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256

  ALPN, server accepted to use http/1.1

  Server certificate:

   subject: businessCategory=Private Organization; jurisdictionCountryName=US; jurisdictionStateOrProvinceName=Delaware; serialNumber=5157550; C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=github.com
   start date: May  8 00:00:00 2018 GMT
   expire date: Jun  3 12:00:00 2020 GMT
   subjectAltName: host "github.com" matched cert's "github.com"
   issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 Extended Validation Server CA
   SSL certificate verify ok.

  GET /facebook/zstd/archive/v1.4.4.zip HTTP/1.1

  Host: github.com

  User-Agent: curl/7.51.0

  Accept: */*

  HTTP/1.1 302 Found

  date: Fri, 17 Apr 2020 00:22:58 GMT

  content-type: text/html; charset=utf-8

  server: GitHub.com

  status: 302 Found

  vary: X-PJAX, Accept-Encoding, Accept, X-Requested-With

  location: https://codeload.github.com/facebook/zstd/zip/v1.4.4

  cache-control: max-age=0, private

  strict-transport-security: max-age=31536000; includeSubdomains; preload

  x-frame-options: deny

  x-content-type-options: nosniff

  x-xss-protection: 1; mode=block

  expect-ct: max-age=2592000,
  report-uri="https://api.github.com/_private/browser/errors"

  content-security-policy: default-src 'none'; base-uri 'self';
  block-all-mixed-content; connect-src 'self' uploads.github.com
  www.githubstatus.com collector.githubapp.com api.github.com
  www.google-analytics.com github-cloud.s3.amazonaws.com
  github-production-repository-file-5c1aeb.s3.amazonaws.com
  github-production-upload-manifest-file-7fdce7.s3.amazonaws.com
  github-production-user-asset-6210df.s3.amazonaws.com cdn.optimizely.com
  logx.optimizely.com/v1/events wss://live.github.com; font-src
  github.githubassets.com; form-action 'self' github.com gist.github.com;
  frame-ancestors 'none'; frame-src render.githubusercontent.com; img-src
  'self' data: github.githubassets.com identicons.github.com
  collector.githubapp.com github-cloud.s3.amazonaws.com
  *.githubusercontent.com; manifest-src 'self'; media-src 'none'; script-src
  github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com

  Content-Length: 118

  X-GitHub-Request-Id: D5E6:9ADE:AAC19:E88A4:5E98F6E1

  Ignoring the response-body

  [118 bytes data]

  Connection #0 to host github.com left intact

  Issue another request to this URL:
  'https://codeload.github.com/facebook/zstd/zip/v1.4.4'

    Trying 192.30.255.121...

  TCP_NODELAY set

  Connected to codeload.github.com (192.30.255.121) port 443 (#1)

  ALPN, offering http/1.1

  Cipher selection:
  ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH

  successfully set certificate verify locations:

    CAfile: /etc/ssl/cert.pem
    CApath: none

  TLSv1.2 (OUT), TLS handshake, Client hello (1):

  [222 bytes data]

  TLSv1.2 (IN), TLS handshake, Server hello (2):

  [108 bytes data]

  TLSv1.2 (IN), TLS handshake, Certificate (11):

  [2851 bytes data]

  TLSv1.2 (IN), TLS handshake, Server key exchange (12):

  [300 bytes data]

  TLSv1.2 (IN), TLS handshake, Server finished (14):

  [4 bytes data]

  TLSv1.2 (OUT), TLS handshake, Client key exchange (16):

  [37 bytes data]

  TLSv1.2 (OUT), TLS change cipher, Client hello (1):

  [1 bytes data]

  TLSv1.2 (OUT), TLS handshake, Finished (20):

  [16 bytes data]

  TLSv1.2 (IN), TLS change cipher, Client hello (1):

  [1 bytes data]

  TLSv1.2 (IN), TLS handshake, Finished (20):

  [16 bytes data]

  SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256

  ALPN, server accepted to use http/1.1

  Server certificate:

   subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=*.github.com
   start date: Jul  8 00:00:00 2019 GMT
   expire date: Jul 16 12:00:00 2020 GMT
   subjectAltName: host "codeload.github.com" matched cert's "*.github.com"
   issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA
   SSL certificate verify ok.

  GET /facebook/zstd/zip/v1.4.4 HTTP/1.1

  Host: codeload.github.com

  User-Agent: curl/7.51.0

  Accept: */*

  HTTP/1.1 200 OK

  Access-Control-Allow-Origin: https://render.githubusercontent.com

  Content-Disposition: attachment; filename=zstd-1.4.4.zip

  Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline';
  sandbox

  Content-Type: application/zip

  ETag: W/"c0404a1b438a4549e83b7323dadd897d3cf234e6fe6eb9101c5fcdb277420dc7"

  Strict-Transport-Security: max-age=31536000

  Vary: Authorization,Accept-Encoding

  X-Content-Type-Options: nosniff

  X-Frame-Options: deny

  X-XSS-Protection: 1; mode=block

  Date: Fri, 17 Apr 2020 00:23:00 GMT

  X-Varnish: 864982496

  Age: 0

  Via: 1.1 varnish (Varnish/6.0)

  X-Cache: MISS

  X-Cache-Hits: 0

  Accept-Ranges: bytes

  Transfer-Encoding: chunked

  X-GitHub-Request-Id: D5E7:1BBA:1162E:2F991:5E98F6E2

  [633 bytes data]

  [1370 bytes data]

  [54 bytes data]

  [1370 bytes data]

  [1370 bytes data]

  [1370 bytes data]

...

 [1370 bytes data]

  [1370 bytes data]

  [1370 bytes data]

  [1370 bytes data]

  [1370 bytes data]

  Operation timed out after 57526 milliseconds with 229741 out of 2234160
  bytes received

  stopped the pause stream!

  Closing connection 1

  TLSv1.2 (OUT), TLS alert, Client hello (1):

  [2 bytes data]

         --- LOG END ---

make[2]: *** [zstd-1.4.4-prefix/src/zstd-1.4.4-stamp/zstd-1.4.4-download] Error 1
make[1]: *** [CMakeFiles/zstd-1.4.4.dir/all] Error 2
make: *** [all] Error 2
---
Failed   <<< zstd_vendor    [ Exited with code 2 ]
]0;colcon build [152/293 done] [3 ongoing]Aborted  <<< action_msgs
]0;colcon build [153/293 done] [2 ongoing]Aborted  <<< std_msgs
]0;colcon build [154/293 done] [1 ongoing]Aborted  <<< rcl_interfaces
]0;colcon build [155/293 done] [0 ongoing]
Summary: 151 packages finished [11min 39s]
  1 package failed: zstd_vendor
  3 packages aborted: action_msgs rcl_interfaces std_msgs
  7 packages had stderr output: foonathan_memory_vendor qt_gui_cpp rcl_logging_spdlog rviz_rendering rviz_rendering_tests tracetools zstd_vendor
  138 packages not processed
<== '. ../venv/bin/activate && . "/Applications/rti_connext_dds-5.3.1/resource/scripts/rtisetenv_x64Darwin16clang8.0.bash" && /Users/osrf/jenkins-agent/workspace/ci_osx/venv/bin/colcon build --base-paths "src" --build-base "build" --install-base "install" --event-handlers console_cohesion+ console_package_list+ --cmake-args -DBUILD_TESTING=ON --no-warn-unused-cli -DINSTALL_EXAMPLES=OFF -DSECURITY=ON' exited with return code '2'
Build step 'Execute shell' marked build as failure
nuclearsandwich commented 4 years ago

Recent activity hasn't included any of the problems identified in this issue and I think it can be closed.