Closed Erotemic closed 2 years ago
My preference is to defer entirely to numpy for this. If you can get them to accept your diff, I'd be happy to update seaborn to use it.
I'm going to close this for now but if you do submit this to numpy please link back to this issue; would be great to have this work in seaborn.
I did submit the idea to the numpy mailing list, but moderators haven't approved it yet. Numpy moves pretty slow, which is why I went here first, but I do agree it would be better to have this in numpy itself - if it is a mathematically valid thing to do; if not it might be better here as a heuristic, because it does seem better than the current method of just setting it to 10.
EDIT: It was approved, and their was one comment. I missed it. Just responded to it, hopefully discussion picks up.
For reference (and because there are useful links there), the one reply was:
To me, this feels like it might be a better fit for SciPy or possibly statsmodels (but maybe not since neither have histogram functions anymore).The challenge with weighted estimators is how the weights should be interpreted. Stata covers the most important cases of weights https://www.reed.edu/psychology/stata/gs/tutorials/weights.html. Would these be frequency weights? Stata supports only frequency weights https://www.stata.com/manuals/u11.pdf#u11.1.6weight.
I started a PR to numpy for this feature: https://github.com/numpy/numpy/pull/20655
For univariate histogram data with weights seaborn currently just sets the number of bins to 10 and warns the user that they will need to adjust.
Based on the numpy automatic bin selection code, I wrote something similar that seems to work well for weighted histogram points. I'm not sure if there is some theoretical problem with computing bandwidth estimators for weighted data, which is why numpy also doesn't have it, but the adaption Sturges and the Freedman-Diaconis bandwidth estimators I wrote seem to work well, and in any case might be a better default to use than "10".
The MWE of the proposed code is here:
If there is interest in this, I can make a PR where something like this replaces the default in the
seaborn.distributions.plot_univariate_histogram
code.If there is some theoretical issue I'm unaware of that makes this strategy invalid, please send me a pointer to more info.