Closed joshuacwnewton closed 2 years ago
The two "first attempts" are identical, and two "second attempts" are identical. So, it's not stochastic/random -- we're arriving at 2 distinct results in a reliable fashion.
could there be some cached data within the VM that are reused upon re-running the test? (that doesn't entirely explain the behaviour, but might be a starting point for debugging)
could there be some cached data within the VM that are reused upon re-running the test?
Quite possibly! But the strange thing is, the test passes the first time on master
, without ever needing to restart.
Maybe it's some arcane combination of both the changes in #3591 and some difference caused by re-running the test, then... :thinking:
I added some debugging commands to try and get to the bottom of it... however, on that commit (c6d1665), the CI run passed on the first try. I'm going to try re-running it one more time (with the debugging commands) to see if I can get lucky and catch a failure.
EDIT: This time, the second attempt failed. I've updated the summary in the issue description. But, this would suggest that re-attempts/caching don't play a role in whether it fails or not.
Thanks for the very thorough and useful summary @joshuacwnewton. I'll see if I can make any progress.
I ran it on master
on my laptop, and on joplin
at the same time, and got one failure and one success:
So it's nothing to do with #3591.
Since we've got both a failure and a success on local machines, we can compute checksums for the processed data using:
Checksums for failed test (run on local machine)
(venv_sct) [kousu@requiem spinalcordtoolbox]$ find sct_example_data/ -type f | sort | xargs sha256sum | tee failure.sha256sums
Checksums for passed test (run on joplin)
(venv_sct) u108545@joplin:~/spinalcordtoolbox$ find sct_example_data/ -type f | sort | xargs sha256sum | tee success.sha256sums
Then compute the diff between the checksums to see if/how the generates files differ:
Almost every file is different. But, we can look at the files generated by batch_processing.sh
that were identical between runs, so we can get a sense of when the two runs diverged:
If we're lucky, maybe we the trace of batch_processing.sh
is identical up to the last of these created files, and after that everything else was wrong.
I think I can trace backwards. I can sort the inputs and outputs by creation time, and then if I combine that with comparing the hashes in the failure cases to the success cases I can isolate where the first divergence crept in, and from there hopefully isolate what line in batch_processing
is the source.
From this investigation, it looks like the first file that differs between runs is t2_seg.nii.gz
, which must have been generated from this line:
Is there secret stochasticness inside of deepseg
?
I reran the processing on joplin
again and it passed a second time.
However, when we compare both joplin
successes, we see the same pattern as failure-vs-success, so there must be noise in the signal:
.gz files can embed timestamps in their headers, that's probably the major source of variation here.
I can confirm that the timestamps are causing a lot of extra diffs here:
However, my attempts at removing the timestamps are still showing a lot of diffs, even between the two successful cases:
OK, I've been able to figure out a way to account for the timestamps:
So, it looks like the actual first output file that differs is ./t2/warp_curve2straight.nii.gz
(accounting for header timestamps), which must be generated by:
which I guess must be
FWIW, this command is also what stood out to me! (Note each of the "diff (registration)" links in the issue description -- they show how the batch_processing.sh logs differ for that command specifically between passing/failing runs.)
Okay let's focus on that then. I injected debugging code into batch_processing.sh
on the branch ng/3597-debug
which makes snapshots of the files that have been produced in between each command.
Then I ran it a bunch of times on my laptop and on joplin:
tl;dr: The output files produced on a single machine were always consistent between runs, however the output files differ between machines!
So, it's some kind of weird hardware thing??? And the randomness we're seeing in CI is partly fed by Github running things on different physical machines?
There is this potentially hardware-sensitive warning in the log:
To avoid intensity overflow due to convertion to +uint8+, intensity will be rescaled to the maximum quantization scale
I'm playing with some more-specific debugging code to try and pinpoint exactly where in sct_register_to_template
the outputs diverge, but so far it hasn't been much help.
I think the divergence must be as soon as:
Interestingly, the values found between joplin
and my laptop also match the passing/failing values from the diffs of the CI logs. So, I think we can confirm that there really are only 2 outcomes, and these outcomes are dependent on hardware.
OK, I think the bug happens specifically in the resample_file
function:
Here's a comparison of hardware specs between joplin
and my laptop, btw:
OK, I think I've narrowed it down to the nibabel.processing.resample_from_to
function called inside the resample_file
function here:
After this has been called, a bunch of off-by-one errors appear in the data array:
This feels like a rounding error. Like the FPU is rounding up on some systems and rounding down on others.
Alright, I've written a script that reproduces the issue:
So, it looks like a successful run involves rounding up, while a failing run involves rounding down, hence there being exactly 2 outcomes.
@joshuacwnewton, could you try running this script from the ng/3597-debug
branch, then seeing if your results match mine?
$ sha256sum ./t2/warp_curve2straight.nii.gz
1d33dc99b15f732d5a2722e548d12413698af9c3047bdc9aa78d3fc66ea68607 ./t2/warp_curve2straight.nii.gz
That said, nibabel hasn't had a release in a year, so it's strange that anything would have changed. But maybe numpy changed something?
Warning
We recommend you use the get_fdata method instead of the get_data method, because it is easier to predict the return data type. get_data will be deprecated around November 2019 and removed around November 2021.
Still, I tried this suggestion in 270642e3c54c1fb3f302a83cc7596474f3938882, but it didn't fix it, it just made the numbers come out as floats:
I guess the next thing to do is submit a bug to nibabel, and using rep.py
, start poking at
adding sha256sum
s around each np.array
produced in that function to see where the divergence is.
In PR https://github.com/spinalcordtoolbox/spinalcordtoolbox/pull/3591, we started experiencing some strange behavior for the
test-batch-processing.yml
workflow:My instinct was that this was a one-off glitch, since at first, restarting the run would fix the issue.
However, this has happened three separate times:
Some observations:
perl
are identical between runs, so I don't think the changes in #3591 are specifically to blame.