Support probe removal better

hansenlab / minfi

Devel repository for minfi

58 stars 70 forks source link

Support probe removal better #55

Open kasperdanielhansen opened 8 years ago

kasperdanielhansen commented 8 years ago

From: Maarten van Iterson mviterson@gmail.com

Add function argument na.rm=FALSE/TRUE to detectionP which should be passed to colMedians and colMads such that detectionP can handle NAs in the Red and Green intensity matrices of an rgSet. If na.rm=TRUE some detection P-values will be NA, if these were NA on the probe-level, but this is we want. For example, we use this for some probe-level filtering steps e.g. on the number of beads minimally required.

kasperdanielhansen commented 8 years ago

There have been multiple requests for an ability to remove probes prior to various normalization routines, for example based on detection P values. Whether this should be done by completely removing rows in the object or by allowing NAs in the object, is unclear to me at present. One argument against NAs in the object is that it adds (IMO) some frailty: now everything has to be able to deal with NAs, which implies different number of observations for each CpG. Conclusion: I think I'll make it easier to remove rows, and to remove rows based on detectionP.

kasperdanielhansen commented 8 years ago

We added subsetByLoci(). Still need to check that removal based on detectionP() is easy.

sdrakulich commented 6 years ago

I'm going to bump this request. Subsetting the samplesheet prior to creation of an RGset seems most straightforward, but I get a failure: "anyDuplicated(!basenames) is not TRUE"

Doesn't make sense since the full samplesheet runs fine...?

Subsetting the RGset just seems like doing more work needlessly. Granted, what do I know, especially seeing as I'm too much a novice to contribute to the solution myself. Thank you for your time.

kasperdanielhansen commented 6 years ago

I don't understand this report at all. Could you please post how you subset the samplesheet as well as the first couple of lines (output of head(samplesheet))?

sdrakulich commented 6 years ago

I "delete rows" in LibreOffice's Calc, and then resave the file. I don't think the issue is in the "resaving".

The first seven lines remain the header of the samplesheet. Everything below is the standard table you'd have left. Sample name, well, etc. I removed rows below this.

Attempting to read this in produces the duplicate basenames issue...although I don't understand in the slightest as to why/how.

Alternatively, I would just remove them from the RGchannelset object, but I'm running into hell with being unable to subset how I'd like to (subset RGset@colData@rownames from a list-vector of IDs I want).

However, everything I find online points towards subsetting after the preprocessing has been done, which seems technically incorrect.

Thank you very much for the help on this, I greatly appreciate it.