ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
64 stars 29 forks source link

Unified packet capture interface, take 2 #166

Open jaycedowell opened 2 years ago

jaycedowell commented 2 years ago

This PR provides a unified packet capture interface that makes it easier to add new packet formats to Bifrost and to read packet formats from the network (unicast or multicast) or from disk. This also includes a generalization of the packet writer interface. This PR changes the Bifrost API in a couple of ways:

bifrost.udp_capture.UDPCapture is now bifrost.packet_capture.UDPCapture there is a new PacketCaptureCallback class that wraps the callbacks bifrost.udp_transmit is now bifrost.packet_writer

This should also address the segfault I was running into in #137.

codecov-commenter commented 2 years ago

Codecov Report

Attention: Patch coverage is 67.88618% with 395 lines in your changes are missing coverage. Please review.

Project coverage is 67.77%. Comparing base (4fadfa5) to head (0c8f19b). Report is 849 commits behind head on master.

:exclamation: Current head 0c8f19b differs from pull request most recent head a360f0b. Consider uploading reports for the commit a360f0b to get more accurate results

Files Patch % Lines
python/bifrost/rdma.py 0.00% 122 Missing :warning:
python/bifrost/telemetry/__init__.py 34.11% 112 Missing :warning:
python/bifrost/version/__main__.py 0.00% 28 Missing :warning:
python/bifrost/telemetry/__main__.py 0.00% 15 Missing :warning:
python/bifrost/portaudio.py 20.00% 12 Missing :warning:
python/bifrost/psrdada.py 0.00% 12 Missing :warning:
python/bifrost/libbifrost.py 67.64% 11 Missing :warning:
python/bifrost/pipeline.py 82.45% 10 Missing :warning:
python/bifrost/DataType.py 76.31% 9 Missing :warning:
python/bifrost/sigproc2.py 70.96% 9 Missing :warning:
... and 27 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #166 +/- ## ========================================== + Coverage 61.62% 67.77% +6.15% ========================================== Files 64 66 +2 Lines 5300 5741 +441 ========================================== + Hits 3266 3891 +625 + Misses 2034 1850 -184 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jaycedowell commented 2 years ago

Working on the mac support reminded me of something. When I first started working on this new interface I had a "sniff" mode which used linux's raw packet interface with a promiscuous NIC to capture packets destined for a different host. I originally did this because that is how we were running one of our data modes at LWA1. I've since moved on to multicast and I now wonder if "sniff" is still a feature worth having.

jaycedowell commented 2 years ago

There is only one failure in test_udp_io for MacOS now but there are intermittent failures for Ubuntu. That's strange. I'm not sure if this is really better or not.

jaycedowell commented 2 years ago

MacOS is better now but there still seems to be intermittent problems. They sometimes happen when I test locally but it isn't clear why.

league commented 2 years ago

Just a note that I'm able to reproduce this error on my little MacBook Air:

FAIL: test_read_drx_single (test_udp_io.UDPIOTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/league/p/bifrost/test/test_udp_io.py", line 408, in test_read_drx_single
    np.testing.assert_equal(final[i,...], data[i,...])
  File "/nix/store/ycp9cnz64wf6xm0f5i1giwdacj6v89n5-python3.8-numpy-1.21.2/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 345, in assert_equal
    return assert_array_equal(actual, desired, err_msg, verbose)
  File "/nix/store/ycp9cnz64wf6xm0f5i1giwdacj6v89n5-python3.8-numpy-1.21.2/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 934, in assert_array_equal
    assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
  File "/nix/store/ycp9cnz64wf6xm0f5i1giwdacj6v89n5-python3.8-numpy-1.21.2/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not equal

Mismatched elements: 3871 / 8192 (47.3%)
 x: ndarray([[( 65,), (-46,), (-18,), ..., (-49,), ( 63,), ( 18,)],
         [( 65,), (-47,), ( -2,), ..., (-34,), ( 79,), (  2,)]],
        dtype=[('re_im', 'i1')])
 y: ndarray([[( 64,), (-30,), (-34,), ..., (-49,), ( 62,), ( 34,)],
         [( 49,), (-63,), ( -2,), ..., (-34,), ( 79,), (  2,)]],
        dtype=[('re_im', 'i1')])

Is that the only Mac error you're seeing on this branch now?

jaycedowell commented 1 year ago

That's interesting. I just pulled down the recent changes and gave it a shot on my laptop again. The first time I ended up with no failures but this warning:

test_read_multicast (test_udp_io.UDPIOTest) ... Exception in thread Thread-6:
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/python@3.9/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/jaycedowell/CodeSafe/bifrost/test/test_udp_io.py", line 179, in main
    for iseq in self.ring.read(guarantee=True):
  File "/Users/jaycedowell/CodeSafe/bifrost/python/bifrost/ring.py", line 116, in read
    with ReadSequence(self, which=whence, guarantee=guarantee) as cur_seq:
  File "/Users/jaycedowell/CodeSafe/bifrost/python/bifrost/ring.py", line 255, in __init__
    _check(_bf.bfRingSequenceOpenEarliest(self.obj, ring.obj, guarantee))
  File "/Users/jaycedowell/CodeSafe/bifrost/python/bifrost/libbifrost.py", line 117, in _check
    raise EndOfDataStop('BF_STATUS_END_OF_DATA')
bifrost.libbifrost.EndOfDataStop: BF_STATUS_END_OF_DATA
ok

Second run did not show that warnings and there we still no failures. Third run failed with:

======================================================================
FAIL: test_read_multicast (test_udp_io.UDPIOTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jaycedowell/CodeSafe/bifrost/test/test_udp_io.py", line 545, in test_read_multicast
    np.testing.assert_equal(final[i,...], data[i,...])
  File "/opt/homebrew/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 345, in assert_equal
    return assert_array_equal(actual, desired, err_msg, verbose)
  File "/opt/homebrew/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 934, in assert_array_equal
    assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
  File "/opt/homebrew/lib/python3.9/site-packages/numpy/testing/_private/utils.py", line 844, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not equal

Mismatched elements: 15872 / 16384 (96.9%)
 x: ndarray([[( 47, -10), (-23,  27), (-13, -29), ..., (  7,  30),
          (-38, -19), ( 50,  -2)],
         [(  7,   0), (  7,   0), (  7,   0), ..., (  7,   0),...
 y: ndarray([[( 47, -10), (-23,  27), (-13, -29), ..., (  7,  30),
          (-38, -19), ( 50,  -2)],
         [( 50,   4), (-41,  17), ( 10, -29), ..., (-16,  28),...

----------------------------------------------------------------------
Ran 10 tests in 14.530s

FAILED (failures=1)

Fourth run also failed as the same place. Fifth , sixth, and seventh runs were all ok. Eighth failed. I don't really get it.

league commented 1 year ago

Oof. I've seen the EndOfDataStop('BF_STATUS_END_OF_DATA') intermittently too, on Linux, although the test apparently passes. On Mac, it seems that I reliably fail test_read_drx_single. Earlier today, I think I saw that one fail on Linux too, but it was only once. test_read_multicast hasn't failed for me before.

I guess I've been gravitating toward a timing/race condition type of hypothesis. I notice in the send loops in the tests, there's a time.sleep. Is it just acting as a yield to give the other thread(s) a chance to run? But the buffer should be blocking… right? Both on write-when-full or read-when-empty? Or are we non-blocking and assuming the size of the buffer and thread fairness will make it mostly okay. I was going to play with the amounts of data (and/or size of buffer) to see if I can get it to fail more or less often.

jaycedowell commented 1 year ago

I usually use a time.sleep in these contexts to get a more realistic packet/data rate coming out but these look like they might be being used as a crude rate limiter.

As for blocking/non-blocking the only thing that should block would be the ring that connects something like DRXReader with AccumulateOp. Each packet sending call should be returning as soon as the packets are handed off to the kernel. Your race condition idea might be on track. If you look at my last failure the first few elements in the comparison seem to be ok and then the end is all (7, 0).

jaycedowell commented 1 year ago

~FWIW I tried to increase the sleeps in the test suite to 10 ms but that didn't change anything for me. test_read_multicast still fails.~

Nevermind, I seem to have only changes a few of the sleep to 10 ms.

jaycedowell commented 1 year ago

Going to 5 ms for sleep and changing the multicast address to 224.0.0.251 might have done something. I've had eight successful tests and no failures.

jaycedowell commented 1 year ago

Bummer. Interesting that the CI fails on a test that I haven't had a problem with locally.

league commented 1 year ago

Yeah I was just wondering if a CI environment is more or less consistent/reliable when it comes to timing issues. Or neither, but just different. FWIW, I just pulled in the 5ms change, ran it on Linux, and still got EndOfDataStop on test_read_multicast – is this essentially a ring-buffer underflow happening?

test_read_drx (test_udp_io.UDPIOTest) ... ok
test_read_drx_single (test_udp_io.UDPIOTest) ... ok
test_read_multicast (test_udp_io.UDPIOTest) ... Exception in thread Thread-6:
Traceback (most recent call last):
  File "/nix/store/wl02plhc6zf84m6x9984l42wnnnbly5m-python3-3.9.6/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/nix/store/wl02plhc6zf84m6x9984l42wnnnbly5m-python3-3.9.6/lib/python3.9/threading.py", line 910, in run
    self._target(*self._args, **self._kwargs)
  File "/home/league/p/bifrost/test/test_udp_io.py", line 179, in main
    for iseq in self.ring.read(guarantee=True):
  File "/nix/store/zk2rimkn7wdj51dsncsq78f7ynbwbz39-python3-3.9.6-env/lib/python3.9/site-packages/bifrost/ring.py", line 116, in read
    with ReadSequence(self, which=whence, guarantee=guarantee) as cur_seq:
  File "/nix/store/zk2rimkn7wdj51dsncsq78f7ynbwbz39-python3-3.9.6-env/lib/python3.9/site-packages/bifrost/ring.py", line 255, in __init__
    _check(_bf.bfRingSequenceOpenEarliest(self.obj, ring.obj, guarantee))
  File "/nix/store/zk2rimkn7wdj51dsncsq78f7ynbwbz39-python3-3.9.6-env/lib/python3.9/site-packages/bifrost/libbifrost.py", line 117, in _check
    raise EndOfDataStop('BF_STATUS_END_OF_DATA')
bifrost.libbifrost.EndOfDataStop: BF_STATUS_END_OF_DATA
ok
test_read_pbeam (test_udp_io.UDPIOTest) ... ok
test_read_tbn (test_udp_io.UDPIOTest) ... ok
test_write_drx (test_udp_io.UDPIOTest) ... ok
test_write_drx_single (test_udp_io.UDPIOTest) ... ok
test_write_multicast (test_udp_io.UDPIOTest) ... ok
test_write_pbeam (test_udp_io.UDPIOTest) ... ok
test_write_tbn (test_udp_io.UDPIOTest) ... ok

----------------------------------------------------------------------
Ran 10 tests in 32.492s
jaycedowell commented 1 year ago

If I remember correctly EndOfDataStop is raised when you've reached the end of a sequence and there isn't a new sequence to read from.

Another thing interesting about the CI failures is that it looks like the data at the end of the array are ok. We're already skipping the first frame worth of data in the comparison. Would skipping the first two do anything? Is this mis-match along a tuning boundary instead?

jaycedowell commented 1 year ago

So much for that idea. It's really hard to test this when it now only shows up on CI for me.

Also, maybe we should do something about all those numpy deprecation warnings someday.

jaycedowell commented 1 year ago

What?

league commented 1 year ago

Yeah, I'm failing to understand something. I think this mod is related to what your expansion did:

        ## Reduce to match the capture block size
        print(final.shape, data.shape) # (147, 2, 4096) (128, 4, 4096)
        data = data[:final.shape[0],:2,:]
        final = final[:data.shape[0],:2,:]
        self.assertEqual(final.shape, data.shape) # they are same shape
        print(final.shape) # (128, 2, 4096)
        # Loop passes for each i>0 in first axis
        for i in range(1, data.shape[0]):
            print(i)
            np.testing.assert_equal(final[i,...], data[i,...])
        # But slice of first axis fails
        np.testing.assert_equal(final[1:,...], data[1:,...])

I use 2 slice statements to ensure same shape. Then if I loop 1..127 and assert equal, they all pass.

But a slice fed to assert_equal, which I thought should be effectively the same as the loop, fails: Mismatched elements: 40 / 1040384 (0.00384%)

jaycedowell commented 1 year ago

What if you throw in a couple of copies? Say:

data = data[:final.shape[0],:2,:]

becomes

data = data[:final.shape[0],:2,:].copy()

?

jaycedowell commented 1 year ago

I just got a failure for this more recent test suite on test_read_drx. Maybe it's not fixed.

jaycedowell commented 1 year ago

Added a reference to #196 so that we know to update the documentation.

jaycedowell commented 5 months ago

When I make it through https://github.com/ledatelescope/bifrost/issues/226 I'll convert this to a real PR.