ledatelescope / bifrost

A stream processing framework for high-throughput applications.
BSD 3-Clause "New" or "Revised" License
66 stars 29 forks source link

Travis-ci Heisenbugs #63

Closed MilesCranmer closed 7 years ago

MilesCranmer commented 7 years ago

I'm seeing the following issue in Travis builds about 1/2 of the time. I can't reproduce this locally.

File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "build/bdist.linux-x86_64/egg/bifrost/block.py", line 443, in main
unpacked_data = ispan.data_view(self.dtype)
File "build/bdist.linux-x86_64/egg/bifrost/ring.py", line 300, in data_view
buffer=data_buffer, dtype=dtype)
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
MilesCranmer commented 7 years ago

Here is another Travis-ci issue with similar occurrence/non-reproducibility:

FAIL: test_data_sizes (test_block.TestFFTBlock)
Test that different number of bits give correct throughput size
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_block.py", line 286, in test_data_sizes
    self.assertEqual(number_fftd, number_copied)
AssertionError: 163072 != 113920
MilesCranmer commented 7 years ago

I should note that I can't reproduce this locally with CUDA-enabled Bifrost. I have not tried yet with CUDA disabled. It might be that this issue is created somehow when you set the NOCUDA flag.

MilesCranmer commented 7 years ago

I should also note that the number_copied in the second Travis issue changes from run to run (of the times the issue occurs)

MilesCranmer commented 7 years ago

Update: I ran the test suite twice on a local CPU-only Bifrost docker container, and all (CPU) tests passed. I do not know why Travis is having difficulty with this.

MilesCranmer commented 7 years ago

Apparently there is a way to run a Travis instance locally: https://quay.io/organization/travisci. I will try this.

benbarsdell commented 7 years ago

Not sure if it's relevant in this case, but one way I found to debug/induce race conditions is to add a time.sleep(random.random()) into the middle of the TransformBlock definition.

MilesCranmer commented 7 years ago

The plot thickens: a moment ago, I reproduced the array sizing error locally on my MacBook. This error occurs on every execution on this machine, rather than ~1/2 the time.

benbarsdell commented 7 years ago

FWIW I get the FFT failure sometimes on my machine.

MilesCranmer commented 7 years ago

Was it with the CPU version of Bifrost? The GPU one seems to work well for me regarding unit tests.

MilesCranmer commented 7 years ago

Update: I have finally gotten a local travis-ci instance up and running. All tests pass, every time. I still have not been able to reproduce these errors locally.

benbarsdell commented 7 years ago

Closing this as tests seem to be stable now. I believe these issues were solved by a combination of fixing bugs and skipping flaky tests. A couple of the relevant commits: https://github.com/ledatelescope/bifrost/commit/a5448da899fc56eb891f09d61b7498d96da8cb6f https://github.com/ledatelescope/bifrost/commit/fa6b40b09b6014ef446147111e3edc2fafed1724