jamesotto852 / ggdensity

An R package for interpretable visualizations of bivariate density estimates
https://jamesotto852.github.io/ggdensity/
Other
230 stars 14 forks source link

`boundary_x` and `boundary_y` #2

Open dkahle opened 3 years ago

dkahle commented 3 years ago

It'd be neat to have boundary_x and boundary_y arguments that you could pass into geom_hdr() and geom_hdr_lines() when method = "histogram", see my examples in the documentation to remember how it works for geom_histogram(). It'd be nice to have those work for all the methods, in fact. Have you come across any theory that addresses correcting density estimators for restricted support? The naive way would simply be to cut it to 0 and multiplicatively redistribute to the rest of the density (a la the truncated normal distribution), but I'd imagine others have thought about it more.

jamesotto852 commented 3 years ago

I agree that implementing boundary_x and boundary_y mimicking boundary from geom_histogram() is a good idea -- there are definitely use cases for the end user wanting fine control over where the bin breaks occur. However, I do not think these arguments would yield a restricted support? At least, that does not agree with my understanding of the boundary argument from geom_histogram() which seems to parameterize the location of an arbitrary bin break.

Also, I do not see how they would be implemented for anything besides method = "histogram" (and the forthcoming method = "freqpoly") as the other estimators do not perform any binning except for the discretization involved in the Riemann sum. In which case, the bins are so small that I have to think the location of the breaks is irrelevant.

It's possible that the xlim and ylim are close to what you have in mind already. Especially if we implemented some kind of expand = FALSE argument. geom_hdr() only uses data/draws within the rectangle defined by rangex and rangey. I didn't think very much on how I implemented them, and maybe with some changes they could provide a naive way to indicated bounded support. Also, in a way, they already parameterize what boundary_x and boundary_y would -- something to think about.

I haven't done a very in-depth search, but I have come across various methods for dealing with density estimation in the context of restricted supports. The R package bde implements several estimators, however it only deals with 1-dimensional data. In fact, I haven't come across anything that deals with bivariate density estimation with a restricted support (it's certainly possible I just haven't looked hard enough). I imagine we could extend some of the methods implemented in bde (e.g. Müller, Chen), however I haven't read through/understood them yet. If this hasn't been done, I imagine it could be an interesting topic for another paper!