Reproducibility tests failed on macOS

ghost commented 1 year ago

Update: They actually fail on all platforms. See the output from pytest -k 'repro' --runxfail. (Fixed on Linux and Windows platforms in 85eeb31f79f198918b7d5595a5c101d3141212f5.)

Currently we mark reproducibility tests as "xfail" on macOS. See: https://github.com/maccmu/macposts/blob/9148b06406980384b41f36e96021c7f66add7bed/tests/test_dta.py#L8-L11

Currently on macOS:

============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-8.2.0, pluggy-1.5.0
rootdir: /Users/runner/work/macposts/macposts
configfile: pyproject.toml
testpaths: tests, examples
plugins: nbval-0.11.0
collected 23 items

tests/test_dta.py Xx..                                                   [ 17%]
tests/test_graph.py ...                                                  [ 30%]
tests/test_mcdta.py xx..                                                 [ 47%]
tests/test_utils.py .                                                    [ 52%]
examples/mcdta-7link.ipynb ...........                                   [100%]

=================== 19 passed, 3 xfailed, 1 xpassed in 5.00s ===================

psychogeekir commented 1 year ago

Do you mean it happened in test_mcdta.py?

ghost commented 1 year ago

No. This one on the network_7link fixture:

https://github.com/maccmu/macposts/blob/9148b06406980384b41f36e96021c7f66add7bed/tests/test_dta.py#L8-L27

psychogeekir commented 6 months ago

Will the pull request #61 help resolve this issue?

ghost commented 6 months ago

Unfortunately no. See https://github.com/maccmu/macposts/actions/runs/8738763208/job/23978496228#step:6:20

If you have a macOS environment (I do not), it might be better to check there.

psychogeekir commented 6 months ago

I found these sources

https://stefanoborini.com/c-stdlib-stdsrandstdrand-are-not-repeatable-across-platforms/ https://stackoverflow.com/questions/64680033/rand-behaves-differently-between-macos-and-linux

Perhaps we can use random instead of cstdlib? Just a thought.

ghost commented 6 months ago

I am not sure if this is really the cause -- only one of the four tests fails which is just too deterministic if it is caused by rand. The symptom here is quite similar to the one fixed by e481254. So I think it might be due to differences in undefined/implementation-defined behaviors in the standard library (libc++ on macOS vs libstdc++ on GNU/Linux), or some use-after-use or uninitialized memory blocks.

Anyway, I do agree that it is a good idea to avoid rand, which is often of low quality. I was considering a custom PCG generator, which is easy to implement and fast, and its quality should suffice for our use cases. Using <random> is also good because of the better integration with STL. However, please be sure to use and maintain some global random state as in https://github.com/maccmu/macposts/blob/main/macposts/_ext/utils.h, so that we can explicitly control it.

psychogeekir commented 6 months ago

I tested this network_7link with test_dta.py on the macOS. I printed out all in_ccs of all runs. Interestingly, their number of rows can differ, which seems to suggest that the DNL runs end at different intervals since total_interval = -1.

Screenshot 2024-04-18 at 8 22 45 PM

psychogeekir commented 6 months ago

Just a quick question, in testing multiclass cases, why do we only compare the first rows of the CC, instead of the entire CC matrices as in single-class cases? The first row means the CC at interval 0, which are usually 0s.

https://github.com/maccmu/macposts/blob/0a1250e785773efb62861f604b7b321f89aeb1a0/tests/test_mcdta.py#L23-L26

https://github.com/maccmu/macposts/blob/0a1250e785773efb62861f604b7b321f89aeb1a0/tests/test_dta.py#L22-L23

If I remove [1] here, of all four tests, only network_3link can succeed on the macOS. I didn't test them on other platforms.

ghost commented 6 months ago

It was an oversight. Corrected in 4a93e403434cc8eaaae0a890ab9fc89f58ca4fba. Thanks for catching this!

It is actually a severer issue than I thought, and perhaps it has been there since 046f43b45872e708ff86cef009e8a0ebd20c36df. (I can confirm that before the merging all of the four tests worked on GNU/Linux at least.)

ghost commented 6 months ago

I will have a look at this now that I can observe the failures in my local environment.

psychogeekir commented 6 months ago

In the single-class network_7link case, if I change adaptive_ratio to 0 or 1, it can succeed. So it might have some issues with the hybrid routing.

ghost commented 6 months ago

Commit 85eeb31f79f198918b7d5595a5c101d3141212f5 fixed this on Linux and Windows. On macOS three tests still fail. I do not have a macOS environment. Maybe you could have a look at it?

psychogeekir commented 6 months ago

Still no good on macOS. I will take a look.

maccmu / macposts

Reproducibility tests failed on macOS #28