Closed njtierney closed 8 years ago
:+1:
I've also had the complementary issue, where almost all the values in a column are missing, but a few present values are too small to be seen on the plot.
I've tried using the alpha levels to indicate when all(is.na(x))
, making completely missing rows translucent. I suppose if you wanted to get fancy, you could have 3 alpha levels: entirely present, entirely missing, and in between - but that might get visually confusing.
Glad it's not just me having this problem!
I like the idea of using transparency but I'm not sure how this scales when you have larger data, such that there are more data than pixels
possible solution here is to include a marginal histogram
Or can we stick in a little strip along the bottom or top of the graphic to indicate whether there is data missing or present?
We also need to make sure that the names/position of the barplot variables matches the names/position of the vis_dat
I think I see where you are going with the histogram idea - but could you end up with the same problem, where a lot of missing values in one column end up obscuring the one missing value in another column because they would expand the scale of the histogram too much?
One other possibility is using geom_rug()
to mark columns where any(is.na(x))
.
Yeah you are absolutely right, we could run into the same problem.
I was thinking that some sort of a bar could be placed above the columns to indicate whether there are any missings in that column, geom_rug()
could be an interesting way to handle this.
Another option would be to include both the geom_rug()
and the marginal histogram.
My only concern is that in adding in these features the graph will become more "noisy" and hard to explain
commit https://github.com/njtierney/visdat/commit/0fe211c147627608b0209321e779008a35cffd91 has provided a partial solution to this by indicating when there is <0.1% missing data. However, this currently only currently works for vis_miss
, and does not show up in vis_dat
. That's the next step from here, I think.
At the moment I am happy with this solution.
Sometimes if there is only one cell missing in a large dataset of a few thousand, you cannot see the missing cell.
So I think that a little message for
vis_miss
andvis_dat
that just spits out:There are X number of missing values in dataset
this could just be
And perhaps if there are ZERO missing values, it could state that "No missing values found".