Closed ahoskins closed 8 years ago
Yeah, I'm looking into this on an ongoing basis. Let's keep this thread open to document progress / observations about it.
Paste from that commit comment you mentioned:
An aside: I've been noticing lately that identical requests often send back slightly different sets of schedules. It has been worrying me, but I haven't been able to identify the reason. However, now I'm curious if the scoring functions have a lot of collisions, like maybe there's only 10 common values for a schedule's score. So, the sorting "grain-size" is really big, and so they just get sorted randomly within their groups?
An edited log of my comment in telegram about schedule score collisions:
Yeah look. Theres a very small number of discrete value that scores usually take. Here's a count of each score. The key is the schedule score, the value is the number of candidate schedules with that score
This is for one of the test cases in the test suite. The others are similar.
CRITICAL: scores: {0: 23, -30: 1, -20: 11}
Here's a look at the score spread of schedules just before returning from the api. Schedules are sorted according to overall-score
, which is an average of all scoring functions, weighted by preference. All scenarios use default preferences: no-marathons=1, day-classes=1, start-early=1
.
Interesting things:
overall-score
. See start-early
in "2 random courses", and day-classes
in "1st year engineering Fall Term 2014 including the complementary elective".Conclusions:
Returned no schedules. First year engineering doesn't get busy time :p
I rewrote the no-marathons
scorer. I tested it by collecting scores from the test suite, and counting both total schedules produced and number of unique scores.
Here's the baseline: the existing no-marathons
scorer. In one sample run, it produced about 6% unique scores.
Existing scoring function
217 total scores
13 unique scores
5.99 pct unique scores
Here's the new finer-grain no-marathons
scorer. In one sample run, it produced schedules with about 40% unique scores. So, by that measure, it's pretty good!
Finer-grain scoring function
156 total scores
63 unique scores
40.48 pct unique scores
However:
So, it seems pretty clear that the finer-grain function is producing less schedules. I'm not quite sure why that is.
Also, the number of schedules returned seems to be just as variable.
Here's the numbers:
Number of schedules generated
existing mean=205 stddev=8.10
new fine-grain mean=172 stddev=11.3
So the new fine-grain method is actually more variable. I want to believe that the existing discrete scoring is causing these inconsistencies, but the numbers don't really agree.
Even if the existing scoring functions aren't causing the schedule generation inconsistencies, I still think it's better in general to score schedules using traits with continuous values. For future reference, here's the rundown on what worked.
decent - actual
Interesting. Sounds like you make progress with that. Have you tested if it produces the same amount of schedules independent of the order they are queried in? If it doesn't (which I'm guessing not based on the complexity of this) I can send the request to you with all the courses in numeric order. That would be a solid band-aid solution for now that would shield a user from underlying inconsistency.
Yeah - those test scenarios all specify the courses in the same order every time, but the results were still inconsistent. So, unfortunately that band-aid fix won't work. I'll keep looking though. I think it might have to do with how the work is being distributed among worker processes.
Edit: or maybe schedules are being folded into similar schedules inconsistently...
I know the total number is inconsistent, but I wonder if the top 10 is consistent?
Good point. I'll find out.
Commit f00ffd5084259ab4e4f3737d7641551aeb5cb2a3 fixed an incorrect behaviour, which may be good for this issue.
Old behaviour: after each round, cut the candidate list to a certain length. New behaviour: after each round, pop candidates off the heap until it's the certain length.
The heap (pqueue) isn't monotonically sorted - it's a heapq. So, just slicing the list was wrong - the best CERTAIN_LENGTH candidates were not being preserved.
Running the test suite now returns these # of schedules per test
Test # | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
Run 1 | 12 | 33 | 4 | 31 | 0 | 42 | 4 | 11 |
Run 2 | 12 | 37 | 4 | 26 | 0 | 50 | 4 | 10 |
Run 3 | 11 | 39 | 3 | 31 | 0 | 40 | 4 | 9 |
Shit, still inconsistent haha.
With some other changes:
Test # | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
Run 1 | 4 | 34 | 7 | 10 | 0 | 42 | 2 | 2 |
Run 2 | 4 | 33 | 7 | 10 | 0 | 44 | 2 | 2 |
Run 3 | 4 | 32 | 7 | 9 | 0 | 44 | 2 | 4 |
Run 4 | 4 | 34 | 7 | 9 | 0 | 42 | 2 | 2 |
I was just playing around with my schedule this year and noticed that the results I get are dependant on the order of the classes added. Are you aware that this is happening? I remember you mentioning something before how you thought schedules weren't consistent.