Closed dentalfloss1 closed 7 months ago
@dentalfloss1 One thing I noticed when looking into this was that AccumulateOp
is directly saving the ring's contents to final
, i.e., https://github.com/ledatelescope/bifrost/blob/ibverb-support/test/test_udp_io.py#L184 That's probably a bad idea since the ring's data could be getting overwritten or destroyed when the pipeline finishes. It would probably be better save a copy of idata
instead. I don't think that this is the root cause of this issue but it could be a contributing factor.
I'm thinking through this some more. For the first part in AccumulateOp
we have something like:
import numpy as np
final = []
for i in range(100):
final.append( np.random.rand(500) )
for f in final:
print(f.__array_interface__['data'][0])
On my Ubuntu desktop the values printed out increment by 4016 while on my Mac it's 4096. For reference the default Bifrost alignment is 4096.
For the next part in the test suite we do something like:
f = np.array(final)
print(f.__array_interface__['data'][0]
On my desktop I get something that is aligned at 16 while my Mac still goes to an alignment of 4096.
Then we do a transpose (which is probably in place) and a copy:
g = f.transpose(1,0).copy()
print(g.__array_interface__['data'][0]
This time on my desktop I get something aligned at 32 while the Mac still ends up at 4096.
My guess is that using bf.ndarray(buffer=...)
is sensitive to how the provided buffer is aligned. I'm not really sure what the mechanism would be, though. I just don't see something like that in the code.
If you build Bifrost with an alignment of 16 instead of 4096 do these failures disappear?
I tried the test suggested above and it does... something. I still get failures on my desktop but now they are things like a comparison with an array full of zeros. That's believable if you assume some packets are getting dropped for whatever reason.
I'm still not convinced that I'm seeing the whole picture.
And why are all of the self-hosted tests failing now with
checking for valid CUDA architectures... found: 50 52 53 60 61 62 70 72 75 80 86 87 89 90
configure: error: failed to find any
checking which CUDA architectures to target...
Error: Process completed with exit code 1.
?
Update: This has been resolved.
Maybe this was all a problem with how we were directly saving the ring's contents rather than a copy. After refactoring the disk and UDP I/O tests in ibverb-support
the problem seems to have largely gone away. I do occasionally see a test failure but it looks like it's comparing zeros (dropped packets) against real data.
In the current version of test_udp_io.py in the ibverbs-support branch, we fill bfarrays like so:
final = bf.ndarray(shape=(final.shape[0],4,4096), dtype='ci4', buffer=final.ctypes.data)
This method produces a bfarray that occasionally does not contain the correct data. It's possible this is platform dependent as these tests tend to not fail on Mac. This following method seems to work consistently:final = bf.ndarray(final,dtype='ci4')
Attached is a test in which the first method fails on our local Ubuntu based machine while the second method passes. redo_test_udp_io.py.txt