Closed jjc2718 closed 3 years ago
Isn't set intersection an operation with a unique output? Is the issue that because you're casting sets to lists later on the order of the resulting index isn't deterministic?
Yeah, I'm pretty sure casting the sets to lists is causing the problem. Even the reindex
calls directly after the lines I changed were returning different output each time - I assume Pandas internally casts the sets to lists, then reorders the resulting dataframe using the list order.
Recently I noticed that running cross-validation multiple times using the same seed gives different results. This fixes that by switching from set intersections to Pandas built-in index intersections (the latter guarantees the same order every time). I've checked that this now leads to repeatable cross-validation results.
I'm also planning to add some regression tests to make sure this isn't happening in the future, but those will be part of my next (larger) PR.