greenelab / pancancer-evaluation

Evaluating genome-wide prediction of driver mutations using pan-cancer data
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Fix cross-validation order bug #22

Closed jjc2718 closed 3 years ago

jjc2718 commented 3 years ago

Recently I noticed that running cross-validation multiple times using the same seed gives different results. This fixes that by switching from set intersections to Pandas built-in index intersections (the latter guarantees the same order every time). I've checked that this now leads to repeatable cross-validation results.

I'm also planning to add some regression tests to make sure this isn't happening in the future, but those will be part of my next (larger) PR.

jjc2718 commented 3 years ago

Isn't set intersection an operation with a unique output? Is the issue that because you're casting sets to lists later on the order of the resulting index isn't deterministic?

Yeah, I'm pretty sure casting the sets to lists is causing the problem. Even the reindex calls directly after the lines I changed were returning different output each time - I assume Pandas internally casts the sets to lists, then reorders the resulting dataframe using the list order.