[Other] Let's use the FSRS Simulator to figure out how to implement Load Balancer

Expertium commented 2 months ago

@L-M-Sherlock @jakeprobst @user1823 @brishtibheja Right now, there are two "competing" ideas regarding Load Balancing. Everyone agrees that making the probability of scheduling a card on a specific day inversely proportional to how many due cards that day has is a good idea. However, me and Jake disagree on the details, and it's not clear who's idea is better. Jake proposes N = days in range, where N is the number of cards with the lowest due count. In other words, his idea is to allow Load Balancer to schedule a card on any day within the fuzz range, just with different non-zero probabilities. My idea: N = max(floor(days in range / 2), 2). This ensures that cards can only be scheduled on days with a relatively low due count and cannot be scheduled on a day with the highest (or one of the highest) due count. It may not be immediately clear what's the difference, so here's a visualiztion: Load Balancer

Suppose the fuzz range for a card includes the following days: [23, 24, 25, 26, 27], days in range = 5. Jake's range and fuzz range are the same, the difference is that in Jake's version the probability of a card falling on a particular day isn't the same for every day. My range is narrower, it only includes (in this example) two days with the lowest due count. The probabilities for the other 3 days are 0. This avoids clustering while staying true to the spirit of this feature, which is "do not schedule a card on a day that already has a lot of cards".

In order to test which idea is better, we could ask Jake to run both on his collection, but that would take months. Instead, Jarrett, I would like you to collaborate with Jake to incorporate Load Balancer into your Simulator, so we can test these ideas on a simulation.

jakeprobst commented 2 months ago

Honestly I was suggesting what I suggested cause it was an easier implementation :leaves:

L-M-Sherlock commented 2 months ago

I implement the simplest load balancer in the simulator. Here are my initial results.

Simulator: https://github.com/open-spaced-repetition/load-balance-simulator/blob/main/notebook.ipynb

Load Balance: enable

Load Balance: disable

The real workload is smoother. But the true retention also drops.

jakeprobst commented 2 months ago

-    delta_t = possible_intervals[np.argmin(review_cnts)]
+    inv = [1 if r == 0 else 1/r  for r in review_cnts]
+    delta_t = random.choices(possible_intervals, weights=inv)[0]

weighted_review_per_day weighted_retention_per_day

doing a weighted random seems to improve the retention but the review count is a bit more erratic (but not as much as with no balancing).

I'll try implementing this in anki tomorrow and trying it out for a bit. I ended up wasting the day failing to convince anki that a plugin to do simulations was a reasonable thing it should be able to do.

L-M-Sherlock commented 2 months ago

@jakeprobst, thanks for your code! I add it into the notebook.

To make our analyses more accurate, I quantify the volatility and average retention.

Here is the result:

	No Load Balancer	Simple Load Balancer	Weight Random Load Balancer
Retention	0.886	0.876	0.885
Volatility	0.2	0.152	0.166

How I calculate the volatility:

volatility = np.std(np.diff(review_card_per_day) / review_card_per_day[1:], ddof=1)

Expertium commented 2 months ago

@L-M-Sherlock can you test my idea as well? N = max(floor(days in range / 2), 2)

With your code, days in range is possible_intervals. Here's an example of what I expect the code to look like:

possible_intervals = np.asarray([18, 19, 20, 21, 22])
review_counts = np.asarray([29, 25, 26, 23, 24])

N = max(int(np.floor(len(possible_intervals) / 2)), 2)  # N days with the lowest due count
sorter = np.argsort(review_counts)  # sort by review counts
sorted_counts = review_counts[sorter]
sorted_intervals = possible_intervals[sorter]
possible_intervals = sorted_intervals[:N]
inv = [1 if r == 0 else 1/r for r in sorted_counts[:N]]
delta_t = random.choices(possible_intervals, weights=inv)[0]

We sort possible_intervals and review_counts by the number of reviews, and then change possible_intervals so that it's a slice of sorted_intervals, with only N days being selected. And then we calculate inv based only on the selected days. In my example, only intervals 21 and 22 are possible, since they have the lowest number of due cards, so we randomly select between those days.

Expertium commented 2 months ago

Also, I don't think your method of calculating volatility is good. According to your method, [100, 125, 150] has lower volatility than [100, 115, 110]. I suggest replacing np.std with np.mean, and calculating the absolute difference. volatility = np.mean(np.abs(np.diff(review_card_per_day1)) / review_card_per_day1[1:])

And one more thing: please make a table with 4 scenarios: no Load Balancer, Simple Load Balancer, Weighted Random Load Balancer, and Restricted Weighted Random Load Balancer, the code for the last one is in my comment above.

L-M-Sherlock commented 2 months ago

	No Load Balancer	Simple Load Balancer	Weighted Random Load Balancer	Restricted Weighted Random Load Balancer
Retention	0.884	0.874	0.884	0.878
Volatility	0.153	0.094	0.118	0.091

user1823 commented 2 months ago

0.091

It seems that there may be an error (or bias) in this value. I can't think of any reason why the Restricted Weighted Random Load Balancer would have a lower volatility than the Simple Load Balancer.

Expertium commented 2 months ago

Might be noise. @L-M-Sherlock try increasing deck_size to 20000, and set review_limits to 9999.

L-M-Sherlock commented 2 months ago

I think we need run more simulations with different random seeds and use t-test.

Expertium commented 2 months ago

Then run 20 of each (so that's 20*4=80 runs), record retention and volatility values, and post them here. Alternatively, give me the code (.py, not .ipynb, please), and I'll do it myself

L-M-Sherlock commented 2 months ago

volatility mean and std:
                                             mean       std
mode                                                       
No Load Balancer                          0.13395  0.009254
Restricted Weighted Random Load Balancer  0.09020  0.007736
Simple Load Balancer                      0.08130  0.005526
Weighted Random Load Balancer             0.10735  0.009281

t-test:
No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 16.22
p-value: 1.14E-18

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: -4.19
p-value: 1.62E-04

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 6.35
p-value: 1.90E-07

retention mean and std:
                                             mean       std
mode                                                       
No Load Balancer                          0.88610  0.000852
Restricted Weighted Random Load Balancer  0.87950  0.001051
Simple Load Balancer                      0.87565  0.000875
Weighted Random Load Balancer             0.88515  0.000745

t-test:
No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 21.81
p-value: 4.35E-23

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: -12.59
p-value: 3.96E-15

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer:
t-statistic: 19.61
p-value: 1.79E-21

Expertium commented 2 months ago

You should probably switch to scientific notation if p-values are so small. And you should add average retention values as well, not just average volatility. EDIT: ok, I think that's all the info we need. So here's the ranking based on average volatility: Simple > Restricted > Weighted > None. And based on average retention, None > Weighted > Restricted > Simple. So we have to choose either Restricted or Weighted.

user1823 commented 2 months ago

I like Weighted Random Load Balancer more because it produces almost the same retention as no load balancing (88.6 vs 88.5%) while also significantly decreasing the volatility.

Moreover, it will perform better in breaking up clusters of related cards than the Restricted one.

Expertium commented 2 months ago

I don't like that it can schedule a card on a day with the highest due count of all days. That goes against the spirit of this feature.

Expertium commented 2 months ago

I ran the simulations on my own, but with a tweak to my formula: I used np.ceil instead of np.floor, it increases N by 1, making restricted load balancer less...well, restrictive. Here are the results:

mode,seed,volatility,average_retention
volatility mean and std:
                                                     mean       std
mode                                                               
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -14.43
p-value: 5.33E-17

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 6.00
p-value: 5.62E-07

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -4.72
p-value: 3.21E-05

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 1.68
p-value: 1.01E-01

No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 16.22
p-value: 1.14E-18

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -4.19
p-value: 1.62E-04

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 6.35
p-value: 1.90E-07

retention mean and std:
                                                     mean       std
mode                                                               
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -16.05
p-value: 1.63E-18

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 19.41
p-value: 2.56E-21

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -13.49
p-value: 4.52E-16

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 5.87
p-value: 8.56E-07

No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 21.81
p-value: 4.35E-23

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -12.59

Ranking based on volatility (from best to worst): Simple > Restricted (floor) > Restricted (ceil) > Weighted > None Ranking based on retention (from best to worst): None > Weighted > Restricted (ceil) > Restricted (floor) > Simple

You may have noticed that the rankings for volatility and retention are the exact opposites of each other. You can obtain one just by reversing the order of the other one. So it doesn't seem like we can get both desirable properties - same retention but low volatility - simultaneously. I would suggest using Restricted (ceil) simply because it's right in the middle of both rankings. A compromise. I know that choosing something simply because it's right in the middle isn't the best kind of reasoning, but again, based on these results, it seems like we can't have both good properties, so a compromise is necessary.

brishtibheja commented 2 months ago

Are we trying to solve the retention problem or the clustering problem or both at the same time?

Expertium commented 2 months ago

Ideally, we want retention to be unaffected and the load to be very consistent. But we can't have both.

brishtibheja commented 2 months ago

Can we not just schedule newer cards differently, i.e., a bit earlier, to account for this?

Expertium commented 2 months ago

Well, say hello to Double Weighted Random Load Balancer:

            # schedule on days with lower load AND shorter intervals
            inv = [1 if r == 0 else (1 / r) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]
            delta_t = random.choices(possible_intervals, weights=inv)[0]

This makes it so that the probabilities depend not just on the number of due cards, but also on interval lengths. This will systematically drag retention up, since shorter intervals result in higher probabilities.

Results:

mode,seed,volatility,average_retention
volatility mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.10330  0.007901
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -14.43
p-value: 5.33E-17

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 6.00
p-value: 5.62E-07

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -4.72
p-value: 3.21E-05

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 1.68
p-value: 1.01E-01

Restricted Weighted Random Load Balancer (ceil) vs Double Weighted Random Load Balancer:
t-statistic: -3.53
p-value: 1.11E-03

No Load Balancer vs Simple Load Balancer:
t-statistic: 21.85
p-value: 4.10E-23

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 9.08
p-value: 4.69E-11

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 16.22
p-value: 1.14E-18

No Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: 11.26
p-value: 1.13E-13

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -10.79
p-value: 4.00E-13

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -4.19
p-value: 1.62E-04

Simple Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -10.20
p-value: 1.94E-12

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 6.35
p-value: 1.90E-07

Weighted Random Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: 1.49
p-value: 1.46E-01

Restricted Weighted Random Load Balancer (floor) vs Double Weighted Random Load Balancer:
t-statistic: -5.30
p-value: 5.22E-06

retention mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.88880  0.000768
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

t-test:
Restricted Weighted Random Load Balancer (ceil) vs No Load Balancer:
t-statistic: -16.05
p-value: 1.63E-18

Restricted Weighted Random Load Balancer (ceil) vs Simple Load Balancer:
t-statistic: 19.41
p-value: 2.56E-21

Restricted Weighted Random Load Balancer (ceil) vs Weighted Random Load Balancer:
t-statistic: -13.49
p-value: 4.52E-16

Restricted Weighted Random Load Balancer (ceil) vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 5.87
p-value: 8.56E-07

Restricted Weighted Random Load Balancer (ceil) vs Double Weighted Random Load Balancer:
t-statistic: -26.34
p-value: 5.18E-26

No Load Balancer vs Simple Load Balancer:
t-statistic: 38.26
p-value: 5.96E-32

No Load Balancer vs Weighted Random Load Balancer:
t-statistic: 3.75
p-value: 5.83E-04

No Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 21.81
p-value: 4.35E-23

No Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -10.53
p-value: 8.05E-13

Simple Load Balancer vs Weighted Random Load Balancer:
t-statistic: -36.96
p-value: 2.13E-31

Simple Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: -12.59
p-value: 3.96E-15

Simple Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -50.52
p-value: 1.89E-36

Weighted Random Load Balancer vs Restricted Weighted Random Load Balancer (floor):
t-statistic: 19.61
p-value: 1.79E-21

Weighted Random Load Balancer vs Double Weighted Random Load Balancer:
t-statistic: -15.26
p-value: 8.65E-18

Restricted Weighted Random Load Balancer (floor) vs Double Weighted Random Load Balancer:
t-statistic: -31.95
p-value: 4.59E-29

Ok, not gonna lie, I was NOT expecting this to actually work. Double weighted LB achieves higher retention than no LB, 88.8% vs 88.6%. And the volatility is lower than with no LB. Also, Double Weighted is strictly better than Weighted. @jakeprobst @L-M-Sherlock

user1823 commented 2 months ago

Seems that the Double Weighted LB is one of the best compromises yet. I encourage you to try even more tweaks.

PS: When posting the results, don't include the list of p-values if they are highly significant. Frankly speaking, they are currently just adding to the clutter.

jakeprobst commented 2 months ago

rather than just the interval itself (which is going to bias a bit too hard to earlier days?) why not use the distance from the target interval?

Expertium commented 2 months ago

Ok, here's one final tweak. I raised r to the power of 2, and also to the power of 1.5:

            inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]
            delta_t = random.choices(possible_intervals, weights=inv)[0]

The output is getting too long, so I'm not including p-values, except for two.

volatility mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.10330  0.007901
Double Weighted Random Load Balancer (1.5)        0.09225  0.007152
Double Weighted Random Load Balancer (2)          0.09130  0.008430
No Load Balancer                                  0.13395  0.009254
Restricted Weighted Random Load Balancer (ceil)   0.09440  0.008042
Restricted Weighted Random Load Balancer (floor)  0.09020  0.007736
Simple Load Balancer                              0.08130  0.005526
Weighted Random Load Balancer                     0.10735  0.009281

No Load Balancer vs Double Weighted Random Load Balancer (2):
p-value: 9.01E-18

retention mean and std:
                                                     mean       std
mode                                                               
Double Weighted Random Load Balancer              0.88880  0.000768
Double Weighted Random Load Balancer (1.5)        0.88795  0.000759
Double Weighted Random Load Balancer (2)          0.88700  0.000795
No Load Balancer                                  0.88610  0.000852
Restricted Weighted Random Load Balancer (ceil)   0.88140  0.000995
Restricted Weighted Random Load Balancer (floor)  0.87950  0.001051
Simple Load Balancer                              0.87565  0.000875
Weighted Random Load Balancer                     0.88515  0.000745

No Load Balancer vs Double Weighted Random Load Balancer (2):
p-value: 1.37E-03

The retention is still higher with Double Weighted LB with r squared than with no LB, 88.7% vs 88.6%; meanwhile volatility is definitely lower. In fact, volatility is almost as low as with my restricted method. I think this is enough tweaking. Jake, LMSherlock - you final thoughts, please.

Edit: here are all modes, sorted by retention (higher is better):

Double Weighted Random Load Balancer              0.8888
Double Weighted Random Load Balancer (1.5)        0.8880
Double Weighted Random Load Balancer (2)          0.8870
No Load Balancer                                  0.8861
Weighted Random Load Balancer                     0.8852
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8800
Simple Load Balancer                              0.8757

And by volatility (lower is better):

Simple Load Balancer                              0.0813
Restricted Weighted Random Load Balancer (floor)  0.0902
Double Weighted Random Load Balancer (2)          0.0913
Double Weighted Random Load Balancer (1.5)        0.0923
Restricted Weighted Random Load Balancer (ceil)   0.0944
Double Weighted Random Load Balancer              0.1033
Weighted Random Load Balancer                     0.1074
No Load Balancer                                  0.1340

Double Weighted Random Load Balancer with r raised to the power of 2 is within top 3 best modes in both lists.

jakeprobst commented 2 months ago

I'd like to see the numbers for

inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / abs(delta_t - possible_delta_t)) for r, possible_delta_t in zip(review_cnts, possible_intervals)]

maybe pow(2) it as well for good measure.

double weight load balance does look pretty solid though.

I was also experimenting with larger powers like 10 to sort accentuate the actual load balance aspect of all this. Unsure how I feel about this currently.

Expertium commented 2 months ago

You need to add an extra condition for when abs(delta_t - possible_delta_t) is 0. Or use a different function altogether. EDIT: I'll try np.exp(-abs(delta_t - possible_delta_t))

jakeprobst commented 2 months ago

not_zero = lambda k: 1 if k == 0 else k
inv = [(1 / np.power(not_zero(r), 2)) * (1 / not_zero(abs(delta_t - possible_delta_t))) for r, possible_delta_t in zip(review_cnts, possible_intervals)]

¯\(ツ)/¯

Expertium commented 2 months ago

Alright, I experimented with three more weighting functions: np.exp(-np.power(diff, 1/3)) np.exp(-np.power(diff, 1))

                if diff == 0:
                    return 1
                else:
                    return 1/diff

diff is abs(delta_t - possible_delta_t)

Here are the results, sorted:

                                                    volatility
Simple Load Balancer                                0.0813
Restricted Weighted Random Load Balancer (floor)    0.0902
Double Weighted Random Load Balancer (2)            0.0913
Double Weighted Random Load Balancer (exp pow 1)    0.0919
Double Weighted Random Load Balancer (1.5)          0.0923
Double Weighted Random Load Balancer (abs)          0.0934
Double Weighted Random Load Balancer (exp pow 1/3)  0.0939
Restricted Weighted Random Load Balancer (ceil)     0.0944
Double Weighted Random Load Balancer                0.1033
Weighted Random Load Balancer                       0.1074
No Load Balancer                                    0.1340

                                                    retention
Double Weighted Random Load Balancer                0.8888
Double Weighted Random Load Balancer (1.5)          0.8880
Double Weighted Random Load Balancer (2)            0.8870
No Load Balancer                                    0.8861
Double Weighted Random Load Balancer (exp pow 1)    0.8860
Double Weighted Random Load Balancer (exp pow 1/3)  0.8856
Weighted Random Load Balancer                       0.8852
Double Weighted Random Load Balancer (abs)          0.8851
Restricted Weighted Random Load Balancer (ceil)     0.8814
Restricted Weighted Random Load Balancer (floor)    0.8795
Simple Load Balancer                                0.8757

To be fair, it's probably better to sort not by retention itself, but by abs(retention - retention_without_LB), but whatever

Double Weighted Random Load Balancer (2) remains in top 3 on both lists. EDIT: here are the average ranks:

Double Weighted Random Load Balancer (2)              3.0
Double Weighted Random Load Balancer (1.5)            3.5
Double Weighted Random Load Balancer (exp pow 1)      4.5
Double Weighted Random Load Balancer                  5.0
Restricted Weighted Random Load Balancer (floor)      6.0
Simple Load Balancer                                  6.0
Double Weighted Random Load Balancer (exp pow 1/3)    6.5
Double Weighted Random Load Balancer (abs)            7.0
No Load Balancer                                      7.5
Restricted Weighted Random Load Balancer (ceil)       8.5
Weighted Random Load Balancer                         8.5

In case you don't understand what this means, if a mode is number one (best) on both lists, it's average rank is 1.0. If it's number one on one list and number two (second best) on the other list, it's average rank is 1.5. If it's number one on one list and number three on the other, the average rank is 2.0. Lower = better.

Double Weighted Random Load Balancer with r raised to the power of 2 wins. inv = [1 if r == 0 else (1 / np.power(r, 2)) * (1 / delta_t) for r, delta_t in zip(review_cnts, possible_intervals)]

Expertium commented 2 months ago

@L-M-Sherlock before making the final decision, it would be nice to also quantify clustering. My method is kinda complicated though: 1) If a card that was previously reviewed on day X has been scheduled on day Y after being reviewed, check if any other cards have also been reviewed on day X before and are now due on day Y. If yes, append the number of such cards to an array called, say, clusters. For example, if 3 cards have been reviewed on day 50 and all 3 are now scheduled on day 55, append 3. 2) Do this for every review of every card. 3) The final score is len(clusters)/number_of_reviews. In other words, it's the ratio of reviews of cards that have been scheduled on the same day two times in a row to the total number of all reviews.

Then we can compare variants of LB on all 3 dimensions: retention, volatility, clustering. EDIT: I PMed you my .py file on Discord.

L-M-Sherlock commented 2 months ago

If a card that was previously reviewed on day X has been scheduled on day Y after being reviewed, check if any other cards have also been reviewed on day X before and are now due on day Y. If yes, append the number of such cards to an array called, say, clusters. For example, if 3 cards have been reviewed on day 50 and all 3 are now scheduled on day 55, append 3.

Sorry, I don't understand. What if I have 100 reviews today, and 40 of them from 1 day ago, 30 from 2 days ago, 20 from 3 days ago, and 10 from 4 days ago? What's the final score?

Expertium commented 2 months ago

Uh...100/100. Ok, I see the problem. Maybe you or someone else can come up with a better way to quantify clustering.

L-M-Sherlock commented 2 months ago

What about np.std([40, 30, 20, 10])?

Expertium commented 2 months ago

I think it's better to use len(), in other words, the number of unique values. In this case len is 4. Actually, use 1/len, to make it so that higher values = more clustering, lower values = less clustering, I think it's more intuitive that way. 1/4 = a lot of clustering 1/100 = no clustering, each interval is unique

This is for one day. The final score would be a weighted average across all days, weighted by the number of reviews on each day. So 0 < final score <= 1.

Expertium commented 2 months ago

Since @L-M-Sherlock appears to be busy, I have implemented the idea above myself. However, I soon found out that measuring clustering across the entire 90-day period gives pretty much the same results. I changed it so that it only measures clustering on the first 15 days, but even then the results are pretty much the same for all variants. So either my idea with measuring clustering based on how many intervals are unique is bad, or clustering isn't an issue in the first place.

Here are the results. Lower clustering score = better. Score of 1 means that all intervals are exactly the same. The closer the value is to 0, the more intervals are unique. For example, for [1, 1, 1, 2, 2, 3], the clustering score is 1/3, since there are 3 unique values.

                                                  volatility (mean)
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer                                  0.1250

                                                  retention (mean)                           
Double Weighted Random Load Balancer              0.8870
No Load Balancer                                  0.8868
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)
No Load Balancer                                  0.1481
Simple Load Balancer                              0.1495
Double Weighted Random Load Balancer              0.1498
Restricted Weighted Random Load Balancer (ceil)   0.1500
Weighted Random Load Balancer                     0.1500
Restricted Weighted Random Load Balancer (floor)  0.1501

For clustering, all p-values are >0.05, meaning that all variants perform about the same.

user1823 commented 2 months ago

Maybe we need a different way to estimate the amount of clustering. I used ChatGPT to get some ideas. They seem to be worth looking into.

ChatGPT output:

To get a single value that summarizes how often any two cards are reviewed on the same date, you can use aggregate measures that capture the overall tendency of event co-occurrence. Here are a few approaches to compute such a value:

1. Co-occurrence Ratio

Calculate a single co-occurrence ratio that summarizes the frequency with which any two events occur together relative to their individual frequencies.

Steps:

Create a Co-occurrence Matrix: For each pair of events, count how often they occur on the same date.
Compute Individual Frequencies: Count how often each event occurs.
Calculate the Co-occurrence Ratio: For each pair of events, divide the co-occurrence count by the product of their individual frequencies.

Formula:

\text{Co-occurrence Ratio} = \frac{\sum_{i \neq j} \text{Co-occurrence}_{ij}}{\sum_{i} \text{Occurrences}_{i}}

where $\text{Co-occurrence}_{ij}$ is the count of times events $i$ and $j$ co-occur, and $\text{Occurrences}_{i}$ is the total count of event $i$ across the time series.

2. Jaccard Similarity Index

The Jaccard similarity index measures the similarity between two sets and can be adapted for events.

Formula:

\text{Jaccard Index} = \frac{|A \cap B|}{|A \cup B|}

where $A$ and $B$ are the sets of dates on which events occur.

For multiple events, you can generalize this by averaging the pairwise Jaccard indices.

3. Dice Coefficient

The Dice coefficient is another measure of similarity that can be used for co-occurrence.

Formula:

\text{Dice Coefficient} = \frac{2 \cdot |A \cap B|}{|A| + |B|}

4. Average Pairwise Co-occurrence

Calculate the average number of co-occurrences across all pairs of events.

Steps:

Create a Co-occurrence Matrix: For all pairs of events.
Compute the Average Co-occurrence: Sum the co-occurrences of all pairs and divide by the number of pairs.

Formula:

\text{Average Pairwise Co-occurrence} = \frac{\sum_{i \neq j} \text{Co-occurrence}_{ij}}{\text{Number of Pairs}}

5. Correlation of Event Presence

Compute the average pairwise correlation between events.

Steps:

Binary Presence Matrix: Create a binary matrix where rows represent dates and columns represent events.
Compute Pairwise Correlations: Calculate the Pearson or Spearman correlation between each pair of event columns.
Average Correlations: Compute the average of these correlations.

Example Using Python

Here’s an example of how to compute the Average Pairwise Co-occurrence:

import pandas as pd

# Sample data
data = {
    'Date': ['2024-08-01', '2024-08-02', '2024-08-03'],
    'Event 1': [1, 1, 0],
    'Event 2': [0, 1, 1],
    'Event 3': [1, 0, 1]
}

df = pd.DataFrame(data)

# Co-occurrence calculation
co_occurrence_matrix = pd.DataFrame(index=['Event 1', 'Event 2', 'Event 3'], columns=['Event 1', 'Event 2', 'Event 3'], data=0)

for event1 in co_occurrence_matrix.index:
    for event2 in co_occurrence_matrix.columns:
        if event1 != event2:
            co_occurrence_matrix.loc[event1, event2] = ((df[event1] == 1) & (df[event2] == 1)).sum()

# Average Pairwise Co-occurrence
num_pairs = len(co_occurrence_matrix.index) * (len(co_occurrence_matrix.index) - 1) / 2
total_co_occurrences = co_occurrence_matrix.where(np.triu(np.ones_like(co_occurrence_matrix, dtype=bool), k=1)).stack().sum()

average_pairwise_co_occurrence = total_co_occurrences / num_pairs

print("Average Pairwise Co-occurrence:", average_pairwise_co_occurrence)

In this script:

co_occurrence_matrix is used to count the co-occurrence of each event pair.
average_pairwise_co_occurrence is computed by dividing the total co-occurrence by the number of pairs.

Choose the method that best fits your data and goals. Each method provides a different perspective on how often events co-occur.

brishtibheja commented 2 months ago

How is retention higher for double weighted? You said this might be some error, is it error?

Also, about clustering, it seems you're counting the number of unique intervals due cards at a particular date has. For example, among all the cards scheduled for today how many have unique intervals. How about "among all the cards rated today, how many have unique intervals"?

edit: ngl chatgpt can be a boon to creativity.

Expertium commented 2 months ago

How is retention higher for double weighted? You said this might be some error, is it error?

Later, once I have a good metric for clustering, I'll re-run everything with 150 samples instead of 20, to get a definitive answer. For now, I'm thinking about quantifying clustering.

Expertium commented 2 months ago

Pairwise comparisons sound like a huge pain to implement, not to mention the computational complexity. If there are 10 000 reviews, that's (10 000 * 9 999) / 2 = 49 995 000 pairs. Even if I could implement it, it would likely slow the simulation down. But I'll see what I can do.

Expertium commented 2 months ago

Ok, I figured out a way of doing this. I'm not saying I figured out a good way, though.

            matrix = [[-1 for n in range(learn_days)] for id in range(min(learn_days * new_cards_limits, deck_size))]
            # -1 = not introduced yet, 0 = not reviewed, 1 = reviewed

The matrix gets updated after a new day starts + after each review. It contains information about every review of every card on every day. And then, after the simulation is done, there is this monstrosity. It's a for loop inside a for loop inside a for loop:

            n = 0
            co_occurred = 0
            for day in range(learn_days):
                for card1 in range(min(learn_days * new_cards_limits, deck_size)):
                    for card2 in range(card1 + 1, min(learn_days * new_cards_limits, deck_size)):
                        # don't count cards that haven't been introduced yet
                        if (matrix[card1][day] > -1) and (matrix[card2][day] > -1):
                            # 0-0 also counts
                            n += 1
                        if (matrix[card1][day] == 1) and (matrix[card2][day] == 1):
                            co_occurred += 1

This loop takes longer than the simulation itself. This is likely the most inefficient method possible, but oh well. Anyway, the final score is clustering_score = co_occurred / n. It's a ratio of times when two cards happened to be reviewed on the same day, to all reviews of all cards on all days. For example, a value of 0.01 means that there is a 1% chance of two cards both being reviewed on the same randomly selected day. Lower = better.

Here are the results:

                                                  volatility (mean)
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer                                  0.1250

                                                  retention (mean)
Double Weighted Random Load Balancer              0.8870
No Load Balancer                                  0.8868
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)                                                
Simple Load Balancer                              0.0106
Restricted Weighted Random Load Balancer (floor)  0.0110
Restricted Weighted Random Load Balancer (ceil)   0.0112
No Load Balancer                                  0.0117
Weighted Random Load Balancer                     0.0118
Double Weighted Random Load Balancer              0.0120

The amount of clustering is lower with the simple LB than with no LB (p-value=4.03E-16). I don't have an explanation for that. Although, while the difference is statistically significant, it's not practically significant. All numbers are close to 0.01.

Expertium commented 2 months ago

I ran it again, but replaced for day in range(learn_days): with for day in range(15): in the loop, in other words, this time clustering is calculated only based on the first 15 days.

                                                  clustering (mean)      
Simple Load Balancer                              0.0847
Restricted Weighted Random Load Balancer (ceil)   0.0853
Restricted Weighted Random Load Balancer (floor)  0.0854
Weighted Random Load Balancer                     0.0878
Double Weighted Random Load Balancer              0.0880
No Load Balancer                                  0.0905

Same results again. Somehow simple LB results in lower clustering than no LB. So both methods - based on the number of unique interval lengths and based on counting co-occurrences - tell us that clustering is the same, and whether I limit it to 15 days or no has no effect either.

Expertium commented 2 months ago

I also tried Jaccard Similarity Coefficient.

            jaccard = []
            for card1 in range(min(learn_days * new_cards_limits, deck_size)):
                for card2 in range(card1 + 1, min(learn_days * new_cards_limits, deck_size)):
                    intersection = 0
                    union = 0
                    for day in range(learn_days):
                        # don't count cards that haven't been introduced yet
                        if (matrix[card1][day] > -1) and (matrix[card2][day] > -1):
                            # 0-0 doesn't count, only 0-1, 1-0 and 1-1
                            if matrix[card1][day] + matrix[card2][day] > 0:
                                union += 1
                        if (matrix[card1][day] == 1) and (matrix[card2][day] == 1):
                            intersection += 1
                    jaccard.append(intersection/union)

            clustering_score = np.mean(jaccard)

It's similar to counting co-occurrences, the big difference is that unlike the last time, when "no review of card 1 - no review of card 2" counted towards n (the total number of all occurrences), here it doesn't count: "union" means "only A, or only B, or both", but not "neither A nor B". The structure of the loop is also a little different. As you might have guessed, the results are similar. Simple LB produces less clustering than no LB, p-value=2.79E-13.

                                                  clustering (mean)   
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer                                  0.0566
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

With all 3 methods - unique intervals, co-occurrences, Jaccard - the averages of different variants of LB are very close, and without a statistical significance test, it's not obvious whether the tiny differences can be attributed to noise or not. With the first method, the differences are not statistically significant. With the latter two, most differences are significant, but the results are counter-intuitive.

user1823 commented 2 months ago

It seems that No Load Balance doesn't apply fuzz in this simulator.

if delta_t < 2.5 or not enable_load_balance:
    return delta_t

But this doesn't completely explain the results. For example, Restricted Weighted Load Balancer should logically have more clustering than Weighted Random Load Balancer but the results show the opposite.

Expertium commented 2 months ago

It seems that No Load Balance doesn't apply fuzz in this simulator.

You're right. I'll fix that later.

Expertium commented 2 months ago

    if not enable_load_balance:
        # fuzz
        delta_t = random.choices(possible_intervals, weights=list(np.ones_like(possible_intervals)))[0]

This assigns equal probabilities to each interval. I also renamed "No Load Balancer" to "No Load Balancer (fuzz)", for the sake of clarity. For clustering, I'm using Jaccard, lower = better.

                                                  volatility (mean)                                             
Simple Load Balancer                              0.0847
Double Weighted Random Load Balancer              0.0914
Restricted Weighted Random Load Balancer (floor)  0.0925
Restricted Weighted Random Load Balancer (ceil)   0.0943
Weighted Random Load Balancer                     0.0993
No Load Balancer (fuzz)                           0.1252

                                                  retention (mean)                                               
No Load Balancer (fuzz)                           0.8881
Double Weighted Random Load Balancer              0.8870
Weighted Random Load Balancer                     0.8861
Restricted Weighted Random Load Balancer (ceil)   0.8814
Restricted Weighted Random Load Balancer (floor)  0.8809
Simple Load Balancer                              0.8775

                                                  clustering (mean)                                            
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer (fuzz)                           0.0567
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

The results are almost the same as previously.

Expertium commented 2 months ago

@jakeprobst @L-M-Sherlock sorry for the trouble, but I advise you to read all comments made within the last 2 days. TLDR: I tried 3 different methods of measuring clustering, and with all 3 methods the difference between different variants of LB is miniscule. Meanwhile I will run the simulations with 150 samples, to accurately measure the impact on retention.

jakeprobst commented 2 months ago

the simple load balancer doing the best with clustering is very surprising to the point where I feel something is wrong somewhere? I'm not a math guy so I can't add any meaningful input more than it just feels intuitively wrong?

Expertium commented 2 months ago

The problem is that all 3 methods of estimating clustering output very similar average values for every variant of LB, and the differences are very small.

                                                  uniqueness of intervals, 15 days (mean)
No Load Balancer                                  0.1481
Simple Load Balancer                              0.1495
Double Weighted Random Load Balancer              0.1498
Restricted Weighted Random Load Balancer (ceil)   0.1500
Weighted Random Load Balancer                     0.1500
Restricted Weighted Random Load Balancer (floor)  0.1501

                                                  co-occurrences, 90 days (mean)                                                
Simple Load Balancer                              0.0106
Restricted Weighted Random Load Balancer (floor)  0.0110
Restricted Weighted Random Load Balancer (ceil)   0.0112
No Load Balancer                                  0.0117
Weighted Random Load Balancer                     0.0118
Double Weighted Random Load Balancer              0.0120

                                                  Jaccard, 90 days (mean)                                            
Simple Load Balancer                              0.0547
Restricted Weighted Random Load Balancer (floor)  0.0553
Restricted Weighted Random Load Balancer (ceil)   0.0557
No Load Balancer (fuzz)                           0.0567
Weighted Random Load Balancer                     0.0571
Double Weighted Random Load Balancer              0.0573

With the first method, all values are very close to 0.15. With the second method, all values are very close to 0.011. With the third method, all values are very close to 0.055. So either all methods of measuring clustering are bad, or clustering is not an issue and we are worried for no reason.

brishtibheja commented 2 months ago

Can you do this on a real collection? Because it was vaibhav who brought it up first.

Expertium commented 2 months ago

It would take months.

brishtibheja commented 2 months ago

He specifically talked about new cards, so he excluded cards that went to relearning and was talking about a rather small time frame, probably 7 days.

Expertium commented 2 months ago

I don't really see a reason to do so when we have simulations. We just don't know how to measure what we want. Or we're worried about a non-existing problem, which is also possible.

open-spaced-repetition / load-balance-simulator