ropensci / auunconf

repository for the Australian rOpenSci unconference 2016!
18 stars 4 forks source link

Exploration and visualisation of missing data #15

Open njtierney opened 8 years ago

njtierney commented 8 years ago

In my PhD research I work with medical data and there are often large amounts of it missing. In my attempts to explore missing data problems and make my life easier I have done some work on two packages: ggmissing with Di Cook, and mex with Damjan Vukcevic. But, as my PhD research continues, I have been finding it hard to dedicate some serious time to continue work on these packages.

I'd like to propose a project on one, or perhaps both of these packages.

A bit more about them:

ggmissing extends ggplot to allow for missing data to be visualised. This would basically involve creating a couple of ggplot geom_missing_* functions that could be added as a layer to a plot. For example, geom_missing_point() would add in and colour the missing points. You can see more about it on the github repo, and at these slides.

mex is a missingness exploration package. This extends off of some research that I have done into using decision trees to explore missing data. The original idea of the package was to create a framework or even a recommended path for handling missing data. One idea was to break it into exploring, modelling, and confirming.

Exploring would include:

Modelling would include:

Confirming might be something like:

I'm very much open to suggestions about how to implement these ideas.

greenLeopard commented 8 years ago

Snap! I have medical data with missing entries too. I'm interested in being able to visual it and explore clusters of missingness as well as other types of data inconsistencies (e.g. end time before start time). I am hoping to bring a mockup of the kind of datasets that I use at work.

jonocarroll commented 8 years ago

The mice package (Multivariate Imputation by Chained Equations in R) has some good tools for imputation (MCAR/otherwise).

Also have a look at VIM::aggr for producing a neat plot of missing data.

e.g. http://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/

njtierney commented 8 years ago

Thank Jonno,

Thanks for that, VIM certainly does have some useful plots, what do you think about incorporating them into ggmissing?

dicook commented 8 years ago

Keep the package simple. Primary purpose is to make ggplot2 graphics that include the missings in the plot.

cpsievert commented 8 years ago

I'm not very familiar with ggmissing, but I'd like to know more about it!

BTW, here is a nice example of a scatterplot with margins for missing values http://kbroman.org/d3panels/assets/test/scatterplot/

jesse-jesse commented 8 years ago

7 votes from the AuUnconf... :) Might be worth continuing discussions around this..

jesse-jesse commented 8 years ago

Nick created a channel on the AuUnconf slack account. Anyone interested can join discussions there also.