ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
66 stars 29 forks source link

Takes a long time to destroy UDPTransmit Instances, locking up pipeline #191

Closed jack-h closed 1 year ago

jack-h commented 1 year ago

What I wanted to do: Be able to dynamically change the destination of UDP output products from a bifrost pipeline.

What I implemented: There seems to be no way to change destination after a UDPTransmit instance has been created, so I destroy instances and create new ones as commanded by an asynchronous (in the sense that it is driven by callback functions from an etcd client which is watching for key changes) control mechanism.

What I expected to happen: Making new UDPTransmits to be quick, and not have any adverse effects except adding some overhead in the bifrost output block responsible for using these instances.

What happened: On creating and destroying 16 UDPTransmit instances (the system outputs 16 beam streams to different destinations) I experienced catastophic packet loss from the UDP capture blocks at the start of the pipeline. This seemed to be the case even when the block within which the UDPTransmit instances were being used was not even attached to the upstream pipeline buffer rings. I.e. [my interpretation of] the issue is that manipulating the UDPTransmits is locking up bifrost, not simply the bifrost block using them is now too slow.

Trying to recreate the issue:

import sys
import time
import numpy as np

from bifrost.packet_writer import UDPTransmit
from bifrost.udp_socket import UDPSocket
from bifrost.address import Address

# a class setup not wildly different to the bifrost block I'm using
class Foo():
    def __init__(self, n):
        self.n = n
        self.ss = [UDPSocket() for _ in range(self.n)]
        self.udts = [None for _ in range(self.n)]
    def update_udts(self):
        dts = np.zeros(self.n)
        ttot0 = time.time()
        for i in range(self.n):
            self.udts[i] = None # Don't time this call within the loop
            t0 = time.time()
            self.ss[i] = UDPSocket()
            self.ss[i].connect(Address('0.0.0.0', 10000))
            self.udts[i] = UDPTransmit('pbeam1_96', sock=self.ss[i], core=4)
            t1 = time.time()
            dts[i] = t1 - t0
        ttot1 = time.time()
        return ttot1 - ttot0, dts

n = int(sys.argv[1])
nrun = 2
print("Instatiating class with %d UDT instances" % n)
f = Foo(n)

for run in range(nrun):
    print("Run %d\n" % run)
    tot, dts = f.update_udts();
    print("Total time: %f" % tot)
    print("Total time instantiating UDTs: %f" % dts.sum())
    for i in range(f.n):
        print("Time for UDT %d: %f" % (i, dts[i]))
    print('\n')
    time.sleep(2)

Running this with python my_udt_test.py 16, yields (emphasis added by me):

(casa) ubuntu@lxdlwagpu01:~/jh_tmp$ python my_udt_test.py 16
Instatiating class with 16 UDT instances
Run 0

Total time: 0.643402
Total time instantiating UDTs: 0.643228 <------- Instatiating UDPTransmit's is where all the time is spent
Time for UDT 0: 0.139594
Time for UDT 1: 0.033591
Time for UDT 2: 0.033564
Time for UDT 3: 0.033702
Time for UDT 4: 0.033505
Time for UDT 5: 0.034062
Time for UDT 6: 0.034107
Time for UDT 7: 0.033277
Time for UDT 8: 0.033673
Time for UDT 9: 0.033496
Time for UDT 10: 0.033374
Time for UDT 11: 0.033660
Time for UDT 12: 0.033546
Time for UDT 13: 0.033156
Time for UDT 14: 0.033490
Time for UDT 15: 0.033430

Run 1

Total time: 1.411572
Total time instantiating UDTs: 0.128019 <----------- The total time above isn't dominated by instantiating UDTs, it's dominated by discarding existing ones
Time for UDT 0: 0.025518
Time for UDT 1: 0.007234
Time for UDT 2: 0.006501
Time for UDT 3: 0.006519
Time for UDT 4: 0.006783
Time for UDT 5: 0.006414
Time for UDT 6: 0.006738
Time for UDT 7: 0.006897
Time for UDT 8: 0.006784
Time for UDT 9: 0.008422
Time for UDT 10: 0.006521
Time for UDT 11: 0.006826
Time for UDT 12: 0.006597
Time for UDT 13: 0.006775
Time for UDT 14: 0.006719
Time for UDT 15: 0.006769

In my loaded pipeline, I see calls to replace one UDPTransmit with another take 100-200ms. Replacing 16 takes in excess of 2 seconds. So far I've worked around by issue (kind of) by accepting that bifrost will take a hit with each UDPTransmit change, and staggering them with sleeps so as to allow the pipeline to recover. This mostly works, but it would be nice if these operations were inherently less blocky to the rest of processing. (Or, this could all be side-stepped by allowing an existing UDPTransmit to be reassociated with a new socket / different address after creation)

The possibility obviously remains that I'm doing something dumb somewhere. I am using an LWA-OVRO fork of bifrost -- https://github.com/realtimeradio/bifrost/tree/474d87016b9a7a3d17f5dd83bf8e3b976c987f4a

jaycedowell commented 1 year ago

I just tried this on ibverb-support on my machine and I see:

Instatiating class with 16 UDT instances
Run 0

Total time: 0.267902
Total time instantiating UDTs: 0.267809
Time for UDT 0: 0.129802
Time for UDT 1: 0.009711
Time for UDT 2: 0.009018
Time for UDT 3: 0.009418
Time for UDT 4: 0.009021
Time for UDT 5: 0.009435
Time for UDT 6: 0.008696
Time for UDT 7: 0.009337
Time for UDT 8: 0.009065
Time for UDT 9: 0.009408
Time for UDT 10: 0.008790
Time for UDT 11: 0.009536
Time for UDT 12: 0.008848
Time for UDT 13: 0.009591
Time for UDT 14: 0.008955
Time for UDT 15: 0.009178

Run 1

Total time: 0.079599
Total time instantiating UDTs: 0.079002
Time for UDT 0: 0.024792
Time for UDT 1: 0.003384
Time for UDT 2: 0.003409
Time for UDT 3: 0.003973
Time for UDT 4: 0.003538
Time for UDT 5: 0.003499
Time for UDT 6: 0.003788
Time for UDT 7: 0.003637
Time for UDT 8: 0.003673
Time for UDT 9: 0.003894
Time for UDT 10: 0.003575
Time for UDT 11: 0.003361
Time for UDT 12: 0.003744
Time for UDT 13: 0.003481
Time for UDT 14: 0.003546
Time for UDT 15: 0.003709

If I try your fork I get:

Instatiating class with 16 UDT instances
Run 0

Total time: 0.186030
Total time instantiating UDTs: 0.185925
Time for UDT 0: 0.051361
Time for UDT 1: 0.009504
Time for UDT 2: 0.008832
Time for UDT 3: 0.009131
Time for UDT 4: 0.008820
Time for UDT 5: 0.009109
Time for UDT 6: 0.009128
Time for UDT 7: 0.009078
Time for UDT 8: 0.008773
Time for UDT 9: 0.009378
Time for UDT 10: 0.008594
Time for UDT 11: 0.009174
Time for UDT 12: 0.008662
Time for UDT 13: 0.009045
Time for UDT 14: 0.008568
Time for UDT 15: 0.008769

Run 1

Total time: 0.377750
Total time instantiating UDTs: 0.095538
Time for UDT 0: 0.025517
Time for UDT 1: 0.004659
Time for UDT 2: 0.004635
Time for UDT 3: 0.004750
Time for UDT 4: 0.004712
Time for UDT 5: 0.004597
Time for UDT 6: 0.004830
Time for UDT 7: 0.004692
Time for UDT 8: 0.004628
Time for UDT 9: 0.004654
Time for UDT 10: 0.004840
Time for UDT 11: 0.004716
Time for UDT 12: 0.004529
Time for UDT 13: 0.004838
Time for UDT 14: 0.004477
Time for UDT 15: 0.004464

which is similar to what you are seeing.

If I distill this down into a something easier to digest: Reported ibverb-support rtr fork
Run 0 - Total 0.643402 s 0.267902 s 0.186030 s
Run 0 - Init 0.643228 s 0.267809 s 0.185925 s
Total/Init 1 1 1
Run 1 - Total 1.411572 s 0.079599 s 0.377750 s
Run 1 - Init 0.128019 s 0.079002 s 0.095538 s
Total/Init 11 1 4

That makes me think that we fixed this problem (whatever its root really was) in ibverb-support. I would suggest updating your fork but then you would have to contend with the new build system. It might be easier to cherry pick changes from src and see where that gets you.

jack-h commented 1 year ago

Oh, excellent. I'll have a poke around!

On Fri, 2 Dec 2022, 18:43 jaycedowell, @.***> wrote:

I just tried this on ibverb-support on my machine and I see:

Instatiating class with 16 UDT instances Run 0

Total time: 0.267902 Total time instantiating UDTs: 0.267809 Time for UDT 0: 0.129802 Time for UDT 1: 0.009711 Time for UDT 2: 0.009018 Time for UDT 3: 0.009418 Time for UDT 4: 0.009021 Time for UDT 5: 0.009435 Time for UDT 6: 0.008696 Time for UDT 7: 0.009337 Time for UDT 8: 0.009065 Time for UDT 9: 0.009408 Time for UDT 10: 0.008790 Time for UDT 11: 0.009536 Time for UDT 12: 0.008848 Time for UDT 13: 0.009591 Time for UDT 14: 0.008955 Time for UDT 15: 0.009178

Run 1

Total time: 0.079599 Total time instantiating UDTs: 0.079002 Time for UDT 0: 0.024792 Time for UDT 1: 0.003384 Time for UDT 2: 0.003409 Time for UDT 3: 0.003973 Time for UDT 4: 0.003538 Time for UDT 5: 0.003499 Time for UDT 6: 0.003788 Time for UDT 7: 0.003637 Time for UDT 8: 0.003673 Time for UDT 9: 0.003894 Time for UDT 10: 0.003575 Time for UDT 11: 0.003361 Time for UDT 12: 0.003744 Time for UDT 13: 0.003481 Time for UDT 14: 0.003546 Time for UDT 15: 0.003709

If I try your fork I get:

Instatiating class with 16 UDT instances Run 0

Total time: 0.186030 Total time instantiating UDTs: 0.185925 Time for UDT 0: 0.051361 Time for UDT 1: 0.009504 Time for UDT 2: 0.008832 Time for UDT 3: 0.009131 Time for UDT 4: 0.008820 Time for UDT 5: 0.009109 Time for UDT 6: 0.009128 Time for UDT 7: 0.009078 Time for UDT 8: 0.008773 Time for UDT 9: 0.009378 Time for UDT 10: 0.008594 Time for UDT 11: 0.009174 Time for UDT 12: 0.008662 Time for UDT 13: 0.009045 Time for UDT 14: 0.008568 Time for UDT 15: 0.008769

Run 1

Total time: 0.377750 Total time instantiating UDTs: 0.095538 Time for UDT 0: 0.025517 Time for UDT 1: 0.004659 Time for UDT 2: 0.004635 Time for UDT 3: 0.004750 Time for UDT 4: 0.004712 Time for UDT 5: 0.004597 Time for UDT 6: 0.004830 Time for UDT 7: 0.004692 Time for UDT 8: 0.004628 Time for UDT 9: 0.004654 Time for UDT 10: 0.004840 Time for UDT 11: 0.004716 Time for UDT 12: 0.004529 Time for UDT 13: 0.004838 Time for UDT 14: 0.004477 Time for UDT 15: 0.004464

which is similar to what you are seeing.

If I distill this down into a something easier to digest: Reported ibverb-support rtr fork Run 0 - Total 0.643402 s 0.267902 s 0.186030 s Run 0 - Init 0.643228 s 0.267809 s 0.185925 s Total/Init 1 1 1 Run 1 - Total 1.411572 s 0.079599 s 0.377750 s Run 1 - Init 0.128019 s 0.079002 s 0.095538 s Total/Init 11 1 4

That makes me think that we fixed this problem (whatever its root really was) in ibverb-support. I would suggest updating your fork but then you would have to contend with the new build system. It might be easier to cherry pick changes from src and see where that gets you.

— Reply to this email directly, view it on GitHub https://github.com/ledatelescope/bifrost/issues/191#issuecomment-1335670236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWM7BAM2Z2OT44M64H4WDWLI7NPANCNFSM6AAAAAASSC3NPI . You are receiving this because you authored the thread.Message ID: @.***>

jack-h commented 1 year ago

I see the same improvement on my machine when switching branches, so I'll attempt to either merge or cherry-pick as you suggest.

As an aside, with the ibverb-support branch the above script doesn't exit on my office machine. It hangs, and then Ctrl-C results in

^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/jackh/py3-venv/lib/python3.6/site-packages/bifrost/telemetry/__init__.py", line 158, in send
    timeout=TELEMETRY_TIMEOUT)
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/usr/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 1368, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.6/urllib/request.py", line 1325, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "/usr/lib/python3.6/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1046, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 984, in send
    self.connect()
  File "/usr/lib/python3.6/http/client.py", line 1438, in connect
    super().connect()
  File "/usr/lib/python3.6/http/client.py", line 956, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.6/socket.py", line 713, in create_connection
    sock.connect(sa)

Not sure what exactly the telemetry is trying to do, but my machine has working internet.

jaycedowell commented 1 year ago

The telemetry endpoint is down. It should have exited on its own after a couple of minutes of trying.

jack-h commented 1 year ago

Ah OK. I gave it a reasonable number of seconds, but not a reasonable number of minutes