ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
66 stars 29 forks source link

Filling BFArray via buffer argument does not consistently work #231

Closed dentalfloss1 closed 5 months ago

dentalfloss1 commented 7 months ago

In the current version of test_udp_io.py in the ibverbs-support branch, we fill bfarrays like so: final = bf.ndarray(shape=(final.shape[0],4,4096), dtype='ci4', buffer=final.ctypes.data) This method produces a bfarray that occasionally does not contain the correct data. It's possible this is platform dependent as these tests tend to not fail on Mac. This following method seems to work consistently: final = bf.ndarray(final,dtype='ci4')

Attached is a test in which the first method fails on our local Ubuntu based machine while the second method passes. redo_test_udp_io.py.txt

jaycedowell commented 6 months ago

@dentalfloss1 One thing I noticed when looking into this was that AccumulateOp is directly saving the ring's contents to final, i.e., https://github.com/ledatelescope/bifrost/blob/ibverb-support/test/test_udp_io.py#L184 That's probably a bad idea since the ring's data could be getting overwritten or destroyed when the pipeline finishes. It would probably be better save a copy of idata instead. I don't think that this is the root cause of this issue but it could be a contributing factor.

jaycedowell commented 6 months ago

I'm thinking through this some more. For the first part in AccumulateOp we have something like:

import numpy as np
final = []
for i in range(100):
    final.append( np.random.rand(500) )

for f in final:
    print(f.__array_interface__['data'][0])

On my Ubuntu desktop the values printed out increment by 4016 while on my Mac it's 4096. For reference the default Bifrost alignment is 4096.

For the next part in the test suite we do something like:

f = np.array(final)
print(f.__array_interface__['data'][0]

On my desktop I get something that is aligned at 16 while my Mac still goes to an alignment of 4096.

Then we do a transpose (which is probably in place) and a copy:

g = f.transpose(1,0).copy()
print(g.__array_interface__['data'][0]

This time on my desktop I get something aligned at 32 while the Mac still ends up at 4096.

My guess is that using bf.ndarray(buffer=...) is sensitive to how the provided buffer is aligned. I'm not really sure what the mechanism would be, though. I just don't see something like that in the code.

If you build Bifrost with an alignment of 16 instead of 4096 do these failures disappear?

jaycedowell commented 6 months ago

I tried the test suggested above and it does... something. I still get failures on my desktop but now they are things like a comparison with an array full of zeros. That's believable if you assume some packets are getting dropped for whatever reason.

I'm still not convinced that I'm seeing the whole picture.

jaycedowell commented 6 months ago

And why are all of the self-hosted tests failing now with

checking for valid CUDA architectures... found: 50 52 53 60 61 62 70 72 75 80 86 87 89 90
configure: error: failed to find any
checking which CUDA architectures to target... 
Error: Process completed with exit code 1.

?

Update: This has been resolved.

jaycedowell commented 6 months ago

Maybe this was all a problem with how we were directly saving the ring's contents rather than a copy. After refactoring the disk and UDP I/O tests in ibverb-support the problem seems to have largely gone away. I do occasionally see a test failure but it looks like it's comparing zeros (dropped packets) against real data.