perfsonar / mesh-config

Centralized configuration framework for measurement points and GUIs
Apache License 2.0
2 stars 0 forks source link

Meshconfig: "Can't find pS cheduler or BWCTL on" using bind options #71

Open igarny opened 7 years ago

igarny commented 7 years ago

Hi Andy Meshconfig service reports it is unable to establish connection with remote participant, but I can capture such communication. Of course meshconfig refuses to create a task although I am able to successfully perform a measurement meshconfig-tasks on the lead participant server (psmp-lhc-bw-01-ams-nl-v4):

bind_address   psmp-lhc-bw-01-ams-nl-v4.geant.net
local_lead_bind_address   psmp-lhc-bw-01-ams-nl-v4.geant.net
<test>
    added_by_mesh   1
    description   LHC IPv4 throughput testing
    <schedule>
        random_start_percentage   10
        type   regular_intervals
        interval   10800
    </schedule>
    <parameters>
        tool   iperf3
        omit_interval   5
        type   bwctl
        duration   30
        force_ipv4   1
        send_only   1
    </parameters>
    target   ps01-nl.geant.net
    target   psmp-lhc-bw-01-gen-ch-v4.geant.net
    target   psmp-lhc-bw-01-lon-uk-v4.geant.net
    target   psmp-lhc-bw-01-fra-de-v4.geant.net
    target   psmp-lhc-bw-01-par-fr-v4.geant.net
    local_address   psmp-lhc-bw-01-ams-nl-v4.geant.net
    <created_by>
        name   LHC Mesh v1
        agent_type   remote-mesh
        uri   http://prod-psma-gn-01-buc-ro.geant.net/mesh/lhcmesh.json
    </created_by>
</test>

on meshconfig service restarts I receive these events in the log for each of the servers in the mesconfig tasks conf

2017/06/02 15:18:15 (23008) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for deletion, skipping test: 400 BAD REQUEST: Can't find pS
cheduler or BWCTL on psmp-lhc-bw-01-lon-uk-v4.geant.net

2017/06/02 15:18:15 (23008) WARN> perfsonar_meshconfig_agent:430 main:: - Problem determining which pscheduler to submit test to for deletion, skipping test: 400 BAD REQUEST: Can't find pS
cheduler or BWCTL on psmp-lhc-bw-01-ams-nl-v4.geant.net

I cannot imagine any reason for the need to submit a test for psmp-lhc-bw-01-ams-nl-v4.geant.net, but anyway:

psmp-lhc-bw-01-ams-nl-v4 > curl -k https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler
"This is the pScheduler API server on psmp-lhc-bw-01-ams-nl-v4.geant.net (psmp-lhc-mgmt-01-ams-nl.geant.net)."

I have checked the traffic on the "bind" interface and there is communication with each meshconfig service restarts as follows (DNS resolves correctly):

15:15:02.608637 IP 62.40.126.163.48322 > 62.40.126.193.443: Flags [S], seq 607792981, win 17920, options [mss 8960,sackOK,TS val 3027837875 ecr 0,nop,wscale 13], length 0
15:15:02.616334 IP 62.40.126.193.443 > 62.40.126.163.48322: Flags [S.], seq 1971223091, ack 607792982, win 17896, options [mss 8960,sackOK,TS val 1063649455 ecr 3027837875,nop,wscale 13], length 0
15:15:02.616383 IP 62.40.126.163.48322 > 62.40.126.193.443: Flags [.], ack 1, win 3, options [nop,nop,TS val 3027837883 ecr 1063649455], length 0
15:15:02.616489 IP 62.40.126.163.48322 > 62.40.126.193.443: Flags [P.], seq 1:248, ack 1, win 3, options [nop,nop,TS val 3027837883 ecr 1063649455], length 247
15:15:02.624271 IP 62.40.126.193.443 > 62.40.126.163.48322: Flags [.], ack 248, win 3, options [nop,nop,TS val 1063649463 ecr 3027837883], length 0
15:15:02.629109 IP 62.40.126.193.443 > 62.40.126.163.48322: Flags [P.], seq 1:1524, ack 248, win 3, options [nop,nop,TS val 1063649468 ecr 3027837883], length 1523
15:15:02.629153 IP 62.40.126.163.48322 > 62.40.126.193.443: Flags [.], ack 1524, win 3, options [nop,nop,TS val 3027837895 ecr 1063649468], length 0
15:15:02.632526 IP 62.40.126.163.48322 > 62.40.126.193.443: Flags [P.], seq 248:374, ack 1524, win 3, options [nop,nop,TS val 3027837899 ecr 1063649468], length 126
15:15:02.641075 IP 62.40.126.193.443 > 62.40.126.163.48322: Flags [P.], seq 1524:1782, ack 374, win 3, options [nop,nop,TS val 1063649480 ecr 3027837899], length 258
15:15:02.641560 IP 62.40.126.163.48322 > 62.40.126.193.443: Flags [P.], seq 374:592, ack 1782, win 3, options [nop,nop,TS val 3027837908 ecr 1063649480], length 218
15:15:02.650571 IP 62.40.126.193.443 > 62.40.126.163.48322: Flags [P.], seq 1782:2036, ack 592, win 3, options [nop,nop,TS val 1063649490 ecr 3027837908], length 254
15:15:02.650610 IP 62.40.126.193.443 > 62.40.126.163.48322: Flags [P.], seq 2036:2067, ack 592, win 3, options [nop,nop,TS val 1063649490 ecr 3027837908], length 31
15:15:02.650616 IP 62.40.126.193.443 > 62.40.126.163.48322: Flags [F.], seq 2067, ack 592, win 3, options [nop,nop,TS val 1063649490 ecr 3027837908], length 0
15:15:02.651472 IP 62.40.126.163.48322 > 62.40.126.193.443: Flags [R.], seq 592, ack 2068, win 4, options [nop,nop,TS val 3027837918 ecr 1063649490], length 0

I am able to connect/test the pscheduler communication manually

psmp-lhc-mgmt-01-ams-nl ~]$ curl --interface 62.40.126.163 -k https://psmp-lhc-bw-01-lon-uk-v4.geant.net/pscheduler
"This is the pScheduler API server on psmp-lhc-bw-01-lon-uk-v4.geant.net (psmp-lhc-mgmt-01-lon-uk.geant.net)."

I am able to run the measurement through pscheduler manually

psmp-lhc-mgmt-01-ams-nl ~]$ pscheduler task --debug --lead-bind psmp-lhc-bw-01-ams-nl-v4.geant.net throughput --omit PT5S --duration PT30S --source psmp-lhc-bw-01-ams-nl-v4.geant.net --ip-version 4 --dest psmp-lhc-bw-01-lon-uk-v4.geant.net --parallel 1
2017-06-02T13:50:38 Debug signal ignored; already not debugging
2017-06-02T13:50:38 Debug discontinued
2017-06-02T13:50:38 Assistance is from localhost
2017-06-02T13:50:38 Forcing default slip of PT5M
2017-06-02T13:50:38 Converting to spec via https://localhost/pscheduler/tests/throughput/spec
Submitting task...
2017-06-02T13:50:38 Fetching participant list
2017-06-02T13:50:38 Spec is: {"source": "psmp-lhc-bw-01-ams-nl-v4.geant.net", "ip-version": 4, "dest": "psmp-lhc-bw-01-lon-uk-v4.geant.net", "duration": "PT30S", "omit": "PT5S", "parallel": 1, "schema": 1}
2017-06-02T13:50:39 Got participants: {u'participants': [u'psmp-lhc-bw-01-ams-nl-v4.geant.net', u'psmp-lhc-bw-01-lon-uk-v4.geant.net']}
2017-06-02T13:50:39 Lead is psmp-lhc-bw-01-ams-nl-v4.geant.net
2017-06-02T13:50:39 Pinging https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler/
2017-06-02T13:50:39 psmp-lhc-bw-01-ams-nl-v4.geant.net is up
2017-06-02T13:50:39 Posting task to https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler/tasks
2017-06-02T13:50:39 Data is {"test": {"type": "throughput", "spec": {"source": "psmp-lhc-bw-01-ams-nl-v4.geant.net", "ip-version": 4, "dest": "psmp-lhc-bw-01-lon-uk-v4.geant.net", "duration": "PT30S", "omit": "PT5S", "parallel": 1, "schema": 1}}, "schedule": {"slip": "PT5M"}, "lead-bind": "psmp-lhc-bw-01-ams-nl-v4.geant.net", "schema": 1}
Task URL:
https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler/tasks/9d97ef86-4d17-410a-b306-38309dd21f75
2017-06-02T13:50:46 Posted https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler/tasks/9d97ef86-4d17-410a-b306-38309dd21f75
Running with tool 'iperf3'
Fetching first run...
2017-06-02T13:50:46 Fetching https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler/tasks/9d97ef86-4d17-410a-b306-38309dd21f75/runs/first
2017-06-02T13:50:48 Handing off: pscheduler watch --format text/plain --debug https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler/tasks/9d97ef86-4d17-410a-b306-38309dd21f75
2017-06-02T13:50:49 Debug signal ignored; already not debugging
2017-06-02T13:50:49 Debug discontinued
2017-06-02T13:50:49 Fetching https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler/tasks/9d97ef86-4d17-410a-b306-38309dd21f75

Next scheduled run:
https://psmp-lhc-bw-01-ams-nl-v4.geant.net/pscheduler/tasks/9d97ef86-4d17-410a-b306-38309dd21f75/runs/91d0405b-ed08-4215-906e-687da95afaa2
Starts 2017-06-02T13:50:56Z (~6 seconds)
Ends   2017-06-02T13:51:40Z (~43 seconds)
Waiting for result...

* Stream ID 4
Interval       Throughput     Retransmits    Current Window
0.0 - 1.0      7.84 Gbps      0              12.38 MBytes    (omitted)
1.0 - 2.0      9.90 Gbps      0              12.38 MBytes    (omitted)
2.0 - 3.0      9.90 Gbps      0              12.38 MBytes    (omitted)
3.0 - 4.0      9.90 Gbps      0              12.38 MBytes    (omitted)
4.0 - 5.0      9.90 Gbps      0              12.38 MBytes    (omitted)
0.0 - 1.0      9.90 Gbps      0              12.38 MBytes
1.0 - 2.0      9.90 Gbps      0              12.38 MBytes
2.0 - 3.0      9.90 Gbps      0              12.38 MBytes
3.0 - 4.0      9.90 Gbps      0              12.38 MBytes
4.0 - 5.0      9.90 Gbps      0              12.38 MBytes
5.0 - 6.0      9.90 Gbps      0              12.38 MBytes
6.0 - 7.0      9.90 Gbps      0              12.38 MBytes
7.0 - 8.0      9.90 Gbps      0              12.38 MBytes
8.0 - 9.0      9.90 Gbps      0              12.38 MBytes
9.0 - 10.0     9.90 Gbps      0              12.38 MBytes
10.0 - 11.0    9.90 Gbps      0              12.38 MBytes
11.0 - 12.0    9.89 Gbps      0              12.62 MBytes
12.0 - 13.0    9.90 Gbps      0              12.62 MBytes
13.0 - 14.0    9.90 Gbps      0              12.62 MBytes
14.0 - 15.0    9.91 Gbps      0              12.62 MBytes
15.0 - 16.0    9.90 Gbps      0              12.62 MBytes
16.0 - 17.0    9.90 Gbps      0              12.63 MBytes
17.0 - 18.0    9.90 Gbps      0              12.63 MBytes
18.0 - 19.0    9.90 Gbps      0              12.63 MBytes
19.0 - 20.0    9.90 Gbps      0              12.63 MBytes
20.0 - 21.0    9.90 Gbps      0              12.63 MBytes
21.0 - 22.0    9.90 Gbps      0              12.63 MBytes
22.0 - 23.0    9.90 Gbps      0              12.67 MBytes
23.0 - 24.0    9.90 Gbps      0              12.67 MBytes
24.0 - 25.0    9.90 Gbps      0              12.68 MBytes
25.0 - 26.0    9.90 Gbps      0              12.88 MBytes
26.0 - 27.0    9.90 Gbps      0              12.88 MBytes
27.0 - 28.0    9.91 Gbps      0              12.88 MBytes
28.0 - 29.0    9.90 Gbps      0              12.88 MBytes
29.0 - 30.0    9.90 Gbps      0              12.88 MBytes

Summary
Interval       Throughput     Retransmits
0.0 - 30.0     9.90 Gbps      0

No further runs scheduled.
igarny commented 7 years ago

This is a strange case. It appears meshconfig service operates well, when running as a background service, but not on service restarts. I have also noticed, that on service restarts meshoconfig is using the default GW/management interface, but as a background daemon it is using the -tasks.conf and the specified bind options. So now my measurements work fine not from a service restart, but from the successive refreshes.