missing values can be hidden in the presence of large (enough) N

ropensci / visdat

Preliminary Exploratory Visualisation of Data

https://docs.ropensci.org/visdat/

Other

453 stars 47 forks source link

missing values can be hidden in the presence of large (enough) N #18

Closed njtierney closed 8 years ago

njtierney commented 8 years ago

Sometimes if there is only one cell missing in a large dataset of a few thousand, you cannot see the missing cell.

So I think that a little message for vis_miss and vis_dat that just spits out:

There are X number of missing values in dataset

this could just be

paste("There are", sum(is.na(df)), "number of missing values in dataset")

And perhaps if there are ZERO missing values, it could state that "No missing values found".

mdlincoln commented 8 years ago

:+1:

I've also had the complementary issue, where almost all the values in a column are missing, but a few present values are too small to be seen on the plot.

I've tried using the alpha levels to indicate when all(is.na(x)), making completely missing rows translucent. I suppose if you wanted to get fancy, you could have 3 alpha levels: entirely present, entirely missing, and in between - but that might get visually confusing.

njtierney commented 8 years ago

Glad it's not just me having this problem!

I like the idea of using transparency but I'm not sure how this scales when you have larger data, such that there are more data than pixels

njtierney commented 8 years ago

possible solution here is to include a marginal histogram

njtierney commented 8 years ago

Or can we stick in a little strip along the bottom or top of the graphic to indicate whether there is data missing or present?

We also need to make sure that the names/position of the barplot variables matches the names/position of the vis_dat

mdlincoln commented 8 years ago

I think I see where you are going with the histogram idea - but could you end up with the same problem, where a lot of missing values in one column end up obscuring the one missing value in another column because they would expand the scale of the histogram too much?

One other possibility is using geom_rug() to mark columns where any(is.na(x)).

njtierney commented 8 years ago

Yeah you are absolutely right, we could run into the same problem.

I was thinking that some sort of a bar could be placed above the columns to indicate whether there are any missings in that column, geom_rug() could be an interesting way to handle this.

Another option would be to include both the geom_rug() and the marginal histogram.

My only concern is that in adding in these features the graph will become more "noisy" and hard to explain

njtierney commented 8 years ago

commit https://github.com/njtierney/visdat/commit/0fe211c147627608b0209321e779008a35cffd91 has provided a partial solution to this by indicating when there is <0.1% missing data. However, this currently only currently works for vis_miss, and does not show up in vis_dat. That's the next step from here, I think.

njtierney commented 8 years ago

At the moment I am happy with this solution.