hansenlab / minfi

Devel repository for minfi
58 stars 70 forks source link

preprocessNoob using incorrect row indices to Meth and Unmeth matrices #264

Open jonathonl opened 7 months ago

jonathonl commented 7 months ago

https://github.com/hansenlab/minfi/blob/77d33453ca7d438d72fea0f54f9eabde8ab54d97/R/preprocessNoob.R#L386-L392

The preprocessNoob function uses indices from the getProbeType() function to select probes by type using their row index (see code linked above). The problem is that getProbeType() returns a vector based on the manifest which is unlikely to have the same length as the number of probes in the MethylSet. For example, the indices in d2.probes <- which(probe.type == "II") do not map to indices for the same probes in the return matrices from getMeth(MSet) and getUnmeth(MSet), but are in fact used to select the signals for type II probes.

I suspect that fixing this bug will improve the effectiveness of ssNoob in minfi.

jonathonl commented 7 months ago

I created a proof of concept PR. I think a similar issue may be affecting FunNorm as well. I haven't looked at other normalization methods.

kasperdanielhansen commented 7 months ago

This could be a serious bug. However, I am not sure it is. Since I haven't yet had time to investigate, perhaps you could do some leg work.

The code - as currently written - assumes that the output of getProbeType() is a vector of the same length as the MSet and has a 1-1 match between the rows of the MSet etc. You're right in saying that a full manifest object can be much bigger than the size of the MSet. But note how the MSet is an argument to getProbeType? In my recollection, we go to some significant lengths to do the following

  1. Use the annotation to get the right Manifest object (which could be bigger than the MSet).
  2. Do some manipulation to make sure that the returned Manifest object gets subsetted and re-ordered to match 1-1 with the MSet. In my understanding, you're saying that (2) doesn't happen or there is a bug. Which could be true, but I am not certain.

If this mismatch is happening you should be able to see it by taking the default objects in the minfiData package, subset the objects to remove some probes or some loci and then observe that the probeType changes. Right? Could you try to make an example along those lines? That would make it much easier to reason about.

jonathonl commented 7 months ago

You're right in saying that a full manifest object can be much bigger than the size of the MSet.

The issue actually arises when the manifest is smaller than the MSet (not bigger). We observe EPIC v1 idats in the wild that have more probes than what is contained in the B4/B5 EPIC manifest. I have already looked at the minfiData (450k as I remember) and it has the same number of probes as the manifest, so we won't be able to replicate the issue with this test data. But you can see example output from one of our datasets below.

> MSet <- preprocessRaw(rgset)
Loading required package: IlluminaHumanMethylationEPICmanifest
> dim(MSet)
[1] 866091     95
> probe.type <- getProbeType(MSet, withColor = TRUE)
Loading required package: IlluminaHumanMethylationEPICanno.ilm10b4.hg19
> length(probe.type)
[1] 865859

I've done some comparisons of my PR with the latest devel branch. I have taken an RGSet with 866091 probes and made a copy of the RGSet that is subset to the bead addresses found in the B5 manifest (excluding the manufacture change probes). I then ran preprocessNoob() on both RGSets using both the official version and my fixed version of minfi. When comparing the Noob-adjusted betas for the two RGSets, the official version produces a lot of discordance while the differences in the fixed version of minfi are negligible (see attached plots). Excluding a few hundred out-of-band signals from the background estimate shouldn't make much of a difference, so I suspect that the discordance seen in the official version of minfi is because changing the number of rows in the MSet affects which probes get Noob-adjusted.

minfi_noob_plots

kasperdanielhansen commented 7 months ago

Oh, that is interesting. I could imagine we have an assumption that the MSet is always a subset of the Manifest. I could see the potential for a bug here.

The right fix however, is to deal with this inside of getAnnotation to make sure the dimensions match up.

Could you share a couple of samples from your dataset, so any fixes could be checked.

On Thu, Feb 29, 2024 at 1:07 PM Jonathon LeFaive @.***> wrote:

You're right in saying that a full manifest object can be much bigger than the size of the MSet.

The issue actually arises when the manifest is smaller than the MSet (not bigger). We observe EPIC v1 idats in the wild that have more probes than what is contained in the B4/B5 EPIC manifest. I have already looked at the minfiData (450k as I remember) and it has the same number of probes as the manifest, so we won't be able to replicate the issue with this test data. But you can see example output from one of our datasets below.

MSet <- preprocessRaw(rgset) Loading required package: IlluminaHumanMethylationEPICmanifest dim(MSet) [1] 866091 95 probe.type <- getProbeType(MSet, withColor = TRUE) Loading required package: IlluminaHumanMethylationEPICanno.ilm10b4.hg19 length(probe.type) [1] 865859

I've done some comparisons of my PR with the latest devel branch. I have taken an RGSet with 866091 probes and made a copy of the RGSet that is subset to the bead addresses found in the B5 manifest (excluding the manufacture change probes). I then ran preprocessNoob() on both RGSets using both the official version and my fixed version of minfi. When comparing the Noob-adjusted betas for the two RGSets, the official version produces a lot of discordance while the differences in the fixed version of minfi are negligible (see attached plots). Excluding a few hundred out-of-band signals from the background estimate shouldn't make much of a difference, so I suspect that the discordance seen in the official version of minfi is because changing the number of rows in the MSet affects which probes get Noob-adjusted.

minfi_noob_plots.png (view on web) https://github.com/hansenlab/minfi/assets/4069894/337496b6-f143-47ea-a4da-4614d3058e83

— Reply to this email directly, view it on GitHub https://github.com/hansenlab/minfi/issues/264#issuecomment-1971682054, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH6IDMLVMN5XHZQZ563YV5W47AVCNFSM6AAAAABDZNWLW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRGY4DEMBVGQ . You are receiving this because you commented.Message ID: @.***>

-- Best, Kasper

jonathonl commented 7 months ago

I cannot share samples from our data but I could share the bead address list. I'm assuming you can generate an RGChannelSet object with fake (random values) red/green matrices using these addresses. Would that be useful to you?

kasperdanielhansen commented 7 months ago

Perhaps you could generate such a simulated RGChannelSet or perhaps - instead of simulating - you just have integer values in the channels representing different types of probes. That would be useful.

I am still surprised by this, because the called to preprocessRaw should remove any probes/loci not in the manifest. But on the other hand, your plot is pretty convincing.

jonathonl commented 7 months ago

I have created a test case with a couple samples from GEO.

Another thing I discovered is that when using read.metharray.exp(..., force=T) with idats that have different sizes (i.e., different number of bead addresses), the resulting methyl sets are of the same size as the probe type vector. So this issue can only be reproduced with an rgset made from idats of the same size. The test case shows this as well.

Running Rscript run.sh from the minfi_issue directory produces:

...

[1] "dim(rgset.1):"
[1] 1051815       1
[1] "dim(rgset.2):"
[1] 1051943       1
[1] "dim(rgset.force):"
[1] 1051539       2
Loading required package: IlluminaHumanMethylationEPICmanifest
[1] "dim(mset.1):"
[1] 866091      1
[1] "dim(mset.2):"
[1] 866238      1
[1] "dim(mset.force):"
[1] 865859      2
Loading required package: IlluminaHumanMethylationEPICanno.ilm10b4.hg19
[1] "length(probe.type.1):"
[1] 865859
[1] "length(probe.type.2):"
[1] 865859
[1] "length(probe.type.force):"
[1] 865859

Note: I had to use LZMA to compress this tar archive because the gzip version was too big for Github.

minfi_issue.lzma.tgz

jonathonl commented 7 months ago

Another thing I discovered is that when using read.metharray.exp(..., force=T) with idats that have different sizes (i.e., different number of bead addresses), the resulting methyl sets are of the same size as the probe type vector. So this issue can only be reproduced with an rgset made from idats of the same size.

I just realized that this is because the set of probes that overlap between these two samples are all in the manifest. So if the intersection of probes across all samples are in the manifiest, then we don't have a problem. The issue can still exist if the idats have different sizes as long as the intersection of probes in the idats is larger than the manifest.

> length(intersect(rownames(mset.1), rownames(mset.2)))
[1] 865859
> dim(mset.force)
[1] 865859      2
> length(intersect(rownames(mset.1), getAnnotation(rgset.1)$Name))
[1] 865859
> length(intersect(rownames(mset.2), getAnnotation(rgset.2)$Name))
[1] 865859
kasperdanielhansen commented 6 months ago

I can replicate the issue.

The core of the issue is because of the (in hindsight wrong) to have separate Manifest and Annotation packages. This means that these packages may not be in sync. The issues happens when you use getAnnotation on a MethylSet/RatioSet etc which was created using a Manifest package (usually by a call to preprocessRaw) which results in more loci (CpGs) than is contained in the Annotation package. The diagnosis is simple: the output of getAnnotation has a smaller number of rows compared to the input object (typically a MethylSet).

This affects output from getAnnotation as well as getProbeType, getLocations, getIslandStatus, and getSnpInfo. A short term fix is to not assume that the output of these functions are matched 1-1 with the input object, but to do an intersection. This does not happen in getProbeType but it does happen in mapToGenome. The long term fix is to no longer have separate Manifest and Annotation packages.

I need to do a more careful analysis of the potential impact of this bug, including a clear description (for users) of how this could happen with various types of IDAT files and array versions. It might be nice to have a have a call with the OP with some more details and their observations. For this reason, I am waiting on merging the pull request.

vjcitn commented 2 months ago

Is this still waiting on a call with the OP?