mozilla / taar-lite

A lightweight version of the TAAR service intended for specific deployments with reduced feature visibility.
Mozilla Public License 2.0
2 stars 6 forks source link

MinInstallPrune does not behave the way we intend #47

Open dzeber opened 6 years ago

dzeber commented 6 years ago

The MinInstallPrune treatment is intended to drop rarely-installed add-ons from consideration in the TAAR-lite recommender. I think the way we'd expect it to behave is that add-ons falling below the threshold cannot be surfaced as recommendations.

In other words, edges leading to such rare add-ons are deleted from the recommendation graph, and columns corresponding to such add-ons would be deleted from the matrix representation.

However, the current implementation deletes the outer key from the nested dict:

from taar_lite.recommenders.guidguid import GuidGuidCoinstallRecommender
from taar_lite.recommenders.treatments import MinInstallPrune

dict_to_be_pruned = {
    'a': {'b': 10, 'c': 13},
    'b': {'c': 4, 'a': 10},
    'c': {'a': 13, 'b': 4}
}
ranking_dict = {
    "a": 100,
    "b": 100,
    "c": 1
}

pruner = MinInstallPrune()
pruner._set_min_install_threshold(ranking_dict)
# pruner.min_installs == 3.35

pruned_dict = pruner.treat(dict_to_be_pruned, ranking_dict=ranking_dict)
#pruned_dict == {
#    'a': {'b': 10, 'c': 13}, 
#    'b': {'c': 4, 'a': 10}
#}

This has the opposite effect: no recommendations will be shown for rare add-ons, but they may still show up as recommendations for other add-ons.

It may be worth reconfirming here what we think the threshold pruning behaviour should be. I can see arguments for multiple approaches.

Another option would be to drop rare add-ons from consideration entirely: no recommendations are surfaced for them, and they are never recommended. However, then we get into an iterative pruning procedure: if the pruning drops all the subkeys for a given outer key, we would then need to prune the outer key, etc.

Thoughts @mlopatka @birdsarah? This discussion relates to the one in issue #43.

birdsarah commented 6 years ago

First thing is that MinInstallPrune is only on dev branch not in master so it's not yet in production, so there's no stress for fixing it.

In current production, pruning happens at the recommend step here: https://github.com/mozilla/taar-lite/blob/master/taar_lite/recommenders/guid_based_recommender.py#L213

Looking at the corresponding code for MinInstallPrune: https://github.com/mozilla/taar-lite/blob/dev/taar_lite/recommenders/treatments.py#L57

I see that the problem is that I did a bad job porting the code. Some simple tests would have picked this up.

My understanding was that the low-install prune would remove the low-installed items from keys and values resulting in a pruned, still-symmetric matrix.

The point of the low-install prune is to preserve privacy.

The current code in production, removes low-ranked items from recommendations, but still makes recommendations for low-ranked items. Is this privacy safe? If yes, then I can see this is better because we're making more recommendations.

birdsarah commented 6 years ago

@dzeber - if you want to do a PR with a test in it, you've already written a test case :D

birdsarah commented 6 years ago

And I should have said good catch, and apologies if this has slowed you down.

dzeber commented 6 years ago

No problem. The reason this came up is that I really like the approach on the dev branch to frame the graph operations as treatments, and I've been going over that code in detail to write documentation. In fact, I'm planning to submit a PR with some minor refactors to the treatment code, and I can include this there.

Like I said, I think there are arguments to be made for different pruning options, and this may benefit from a broader discussion outside of this issue. If the concern is privacy, pruning doesn't offer any real guarantee, although it does reduce the precision of what can be ascertained about user data by querying the TAAR service.

Suppose we have a rare, possibly sensitive add-on A, and the goal is to limit what can be inferred about A's users via knowledge about their coinstalled add-ons. Then I think it would make sense to drop both the keys and values for this add-on.

Of couse, quantifying these arguments depends on the exact normalization and any other treatments we may apply, such as random edge modifications.

Personally, I think that pruning rare add-ons can also be viewed as a quality of experience improvement: maybe a really rare add-on is less likely to be a good recommendation just because it's more obscure, or maybe it's a new add-on that could benefit from more user testing, etc.