Schedule generation inconsistency investigation

ahoskins commented 9 years ago

I was just playing around with my schedule this year and noticed that the results I get are dependant on the order of the classes added. Are you aware that this is happening? I remember you mentioning something before how you thought schedules weren't consistent.

rosshamish commented 9 years ago

Yeah, I'm looking into this on an ongoing basis. Let's keep this thread open to document progress / observations about it.

rosshamish commented 9 years ago

Paste from that commit comment you mentioned:

An aside: I've been noticing lately that identical requests often send back slightly different sets of schedules. It has been worrying me, but I haven't been able to identify the reason. However, now I'm curious if the scoring functions have a lot of collisions, like maybe there's only 10 common values for a schedule's score. So, the sorting "grain-size" is really big, and so they just get sorted randomly within their groups?

rosshamish commented 9 years ago

An edited log of my comment in telegram about schedule score collisions:

Yeah look. Theres a very small number of discrete value that scores usually take. Here's a count of each score. The key is the schedule score, the value is the number of candidate schedules with that score

This is for one of the test cases in the test suite. The others are similar.

CRITICAL: scores: {0: 23, -30: 1, -20: 11}

rosshamish commented 9 years ago

Here's a look at the score spread of schedules just before returning from the api. Schedules are sorted according to overall-score, which is an average of all scoring functions, weighted by preference. All scenarios use default preferences: no-marathons=1, day-classes=1, start-early=1.

Interesting things:

Overall scores are all negative. This isn't necessarily bad.
Overall scores seem to fall into a small number of discrete values
Within a single request, there is often a single scoring function which is dominant and affects the overall-score. See start-early in "2 random courses", and day-classes in "1st year engineering Fall Term 2014 including the complementary elective".

Conclusions:

scoring needs to be more granular, and should fall more evenly over a continuous range
it's possible that the inconsistency in which schedules get returned is partly caused by these big "rifts" between groups of schedules - luck determines which of these schedules "make it out" of the schedule generation tornado

2 random courses

testcase1

1st year engineering 2014 Fall Term

testcase2

3rd year CompE 2014 Fall Term

testcase3

2nd year MecE Fall Term 2014

testcase4

1st year engineering Fall Term 2014 with tons of busy time

Returned no schedules. First year engineering doesn't get busy time :p

1st year engineering Fall Term 2014 including the complementary elective

testcase6

rosshamish commented 9 years ago

I rewrote the no-marathons scorer. I tested it by collecting scores from the test suite, and counting both total schedules produced and number of unique scores.

Here's the baseline: the existing no-marathons scorer. In one sample run, it produced about 6% unique scores.

Existing scoring function
217     total scores
13      unique scores
5.99    pct unique scores

Here's the new finer-grain no-marathons scorer. In one sample run, it produced schedules with about 40% unique scores. So, by that measure, it's pretty good!

Finer-grain scoring function
156     total scores
63      unique scores
40.48   pct unique scores

However:

in 4 independent runs, the existing scoring function produced 217, 201, 204, and 199 schedules
in 4 independent runs, the new finer-grain scoring function produced 156, 181, 172, and 179 schedules.

So, it seems pretty clear that the finer-grain function is producing less schedules. I'm not quite sure why that is.

Also, the number of schedules returned seems to be just as variable.

Here's the numbers:

Number of schedules generated
existing        mean=205    stddev=8.10
new fine-grain  mean=172    stddev=11.3

So the new fine-grain method is actually more variable. I want to believe that the existing discrete scoring is causing these inconsistencies, but the numbers don't really agree.

Even if the existing scoring functions aren't causing the schedule generation inconsistencies, I still think it's better in general to score schedules using traits with continuous values. For future reference, here's the rundown on what worked.

define a schedule trait with continuous values
define how to measure the trait
define what's a "decent" value for the trait
report "score" as the difference decent - actual

ahoskins commented 9 years ago

Interesting. Sounds like you make progress with that. Have you tested if it produces the same amount of schedules independent of the order they are queried in? If it doesn't (which I'm guessing not based on the complexity of this) I can send the request to you with all the courses in numeric order. That would be a solid band-aid solution for now that would shield a user from underlying inconsistency.

rosshamish commented 9 years ago

Yeah - those test scenarios all specify the courses in the same order every time, but the results were still inconsistent. So, unfortunately that band-aid fix won't work. I'll keep looking though. I think it might have to do with how the work is being distributed among worker processes.

Edit: or maybe schedules are being folded into similar schedules inconsistently...

ahoskins commented 9 years ago

I know the total number is inconsistent, but I wonder if the top 10 is consistent?

rosshamish commented 9 years ago

Good point. I'll find out.

rosshamish commented 9 years ago

Commit f00ffd5084259ab4e4f3737d7641551aeb5cb2a3 fixed an incorrect behaviour, which may be good for this issue.

Old behaviour: after each round, cut the candidate list to a certain length. New behaviour: after each round, pop candidates off the heap until it's the certain length.

The heap (pqueue) isn't monotonically sorted - it's a heapq. So, just slicing the list was wrong - the best CERTAIN_LENGTH candidates were not being preserved.

Running the test suite now returns these # of schedules per test

Test #	1	2	3	4	6	7	8
Run 1	12	33	4	31	42	4	11
Run 2	12	37	4	26	50	4	10
Run 3	11	39	3	31	40	4	9

Shit, still inconsistent haha.

rosshamish commented 9 years ago

With some other changes:

in candidate_battle_royale, using an empty list to contain the new candidates,
using 4 subprocesses max
using candidate pool of 64 total across all subprocesses
heappushing the candidates as they come back from the subprocesses instead of list - appending them

Test #	1	2	3	4	6	7	8
Run 1	4	34	7	10	42	2	2
Run 2	4	33	7	10	44	2	2
Run 3	4	32	7	9	44	2	4
Run 4	4	34	7	9	42	2	2

rosshamish / classtime