hputnam / project_juvenile_geoduck_OA

This repository provides data and scripts to analyze the influence of ocean acidification on juvenile geoduck DNA methylation and phenotype.
4 stars 3 forks source link

Missing data in the count matrix #6

Closed seanb80 closed 7 years ago

seanb80 commented 7 years ago

I've been playing with merging the Methylation counts of individual samples in to an individual file and had a question regarding missing counts.

MACAU requires the count matrix to be completely populated, i.e. no missing data. Do we want to convert missing data to 0s to fulfill this, or is there some trick imputation strategy to use?

hputnam commented 7 years ago

Lea et al 2015 MACAU paper 'imputed any missing data using the K-nearest neighbors algorithm in the R package impute'

Hastie T, Tibshirani R, Narasimhan B, Chu G. Impute: imputation for microarray data. R package ver- sion 1.42.0. 2015.

We could try this for now. I will read in more detail.

seanb80 commented 7 years ago

I tried the impute package, and even for four samples our data set is apparently too large for it (Get an infinite recursion/recursion depth too deep error regardless of number of expressions allowed).

I'm trying the mice package info found here.

It's reported to be quite slow, so I'll let to grind away for a while and then report what happens!

seanb80 commented 7 years ago

I stepped back and thought about it for a moment, and I don't know if imputation is a good idea in our case.

I've been tinkering with a 4 sample dataset while the others are running, and the four samples have 29, 32, 51, and 43% missing data with an overall missing ness of 39%.

This may be outside of the range that we can justifiably impute missing data.

Any thoughts?

seanb80 commented 7 years ago

After speaking with Steven, I looked at how many loci have complete coverage between all samples, and greater than 10x coverage, that brought the total number of loci to look at from ~2.8 million to 6781. But they all have data!

hputnam commented 7 years ago

:thumbsup: I agree, imputing 30-50% of the data does not seem justifiable.