mozilla / python_mozetl

ETL jobs for Firefox Telemetry
https://mozilla.github.io/python_mozetl/
MIT License
26 stars 30 forks source link

Ensure diversity sampling instead of random (proportionate) sampling #205

Open mlopatka opened 6 years ago

mlopatka commented 6 years ago

https://github.com/mozilla/python_mozetl/blob/32d78c34dbb3c9ff5542f1ebc110f5aeb7fce340/mozetl/taar/taar_similarity.py#L131

The diversity of the donor pool is only ensured by the assumption that higher level clustering is substantially diverse. This could be improved by verification of cross-cluster diversity in the addons space.

mlopatka commented 5 years ago

This also comes back when we specify a proportionate sampling strategy here: https://github.com/mozilla/python_mozetl/blob/491fbda515f985f3156ff0c70859624fd4961ea8/mozetl/taar/taar_similarity.py#L168

A solution here would be to specify weights that emphasize specific (niche) cluster representation in the final sample without compromising the non-addon diversity of "large" cluster sampling.

Even an inverse of the current strategy could be evaluated.

mlopatka commented 5 years ago

@Dexterp37 can you assign this issue to me please? I have insufficient privileges to grab it :|