radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK stuck in a loop after getting resource assignment? #130

Closed lsawade closed 3 years ago

lsawade commented 3 years ago

Hi,

I have tried to run a simple job after updating EnTK, but it seems to get stuck. This Job has worked previously worked. I tail -f'd the radical.log and it seemed like it was looping.

Log: /home/lsawade/simple_entk_specfem/re.session.traverse.princeton.edu.lsawade.018635.0001/radical.log

Sandbox: /scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018635.0001

I tested afterwards with a smaller assignment and a Hello World script running into the same issue. The job is launched on the cluster and gets the appropriate resource, but the stages and tasks are never submitted.

andre-merzky commented 3 years ago

Hey @lsawade - would you mind attaching a tarball of the pilot sandbox to this ticket? Thank you!

lsawade commented 3 years ago

Sorry! Should have done this immediately! --> sandbox.tar.zip

andre-merzky commented 3 years ago

np!

The code hangs while trying to create an ssh tunnel, or, more precisely, while trying to find an open port for creating the tunnel. The host to tunnel to is listed as 10.33.24.12 - is that the expected target IP? Is that IP reachable from the login node?

From bootstrap_0.out:

0.0000,tunnel_setup_start,bootstrap_0,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
# -------------------------------------------------------------------
# Setting up forward tunnel to 129.114.17.185:27017.

################################################################################
## Searching for available TCP port for tunnel in range 23000..23100.
## Found available port: 23000
ssh -o StrictHostKeyChecking=no -x -a -4 -T -N -L 127.0.0.1:23000:129.114.17.185:27017 -p 22 traverse.princeton.edu
1.0000,tunnel_setup_stop,bootstrap_0,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
app tunnel addr : 10.33.24.12

1.0000,tunnel_setup_start,bootstrap_0,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
# -------------------------------------------------------------------
# Setting up forward tunnel to 10.33.24.12.

################################################################################
## Searching for available TCP port for tunnel in range 23000..23100.

The first tunnel is for MongoDB, and tries to set up a tunnel for 129.114.17.185:27017 via host traverse.princeton.edu. The second tunnel tries to connect to 10.33.24.12, using the same tunnel host (traverse.princeton.edu) - and hangs.

10.x.y.z looks like a private network - please check if that is indeed the host you want to connect to (presumable to fetch data?).

lee212 commented 3 years ago

Hi Andre, I am in the ssh terminal and I can confirm It is traverse:

[hyungrol@traverse configs]$ ifconfig
em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.33.24.12  netmask 255.255.252.0  broadcast 10.33.27.255
        inet6 fe80::a94:efff:fe80:6f0a  prefixlen 64  scopeid 0x20<link>
        ether 08:94:ef:80:6f:0a  txqueuelen 1000  (Ethernet)
        RX packets 79083350  bytes 20750833221 (19.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 38994474  bytes 4270363985 (3.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 80
...
[hyungrol@traverse configs]$ cat /etc/hosts
# HEADER: This file was autogenerated at 2020-12-17 16:34:25 -0500
# HEADER: by puppet.  While it can still be managed manually, it
# HEADER: is definitely not recommended.
...
10.33.24.12     traverse
...

Can this be a temporal failure?

lsawade commented 3 years ago

See #126, maybe I'm misunderstanding how I am supposed to do this. I concur with Hyungro, the IP selected is local to the cluster and I can ssh to it.

andre-merzky commented 3 years ago

Can this be a temporal failure?

I doubt it, as one second before we were still able to create a tunnel to the MongoDB host, also via Traverse.

I agree that this overlaps with #126 - @mtitov , can you please look into this? Thanks!

lsawade commented 3 years ago

Problem was solved. There where issues setting up the tunnel as mentioned in #126 , but the problem was solved by by not using a tunnel. Please refer to #126 !