Open jonathonl opened 7 months ago
I created a proof of concept PR. I think a similar issue may be affecting FunNorm as well. I haven't looked at other normalization methods.
This could be a serious bug. However, I am not sure it is. Since I haven't yet had time to investigate, perhaps you could do some leg work.
The code - as currently written - assumes that the output of getProbeType()
is a vector of the same length as the MSet
and has a 1-1 match between the rows of the MSet etc. You're right in saying that a full manifest object can be much bigger than the size of the MSet. But note how the MSet
is an argument to getProbeType
? In my recollection, we go to some significant lengths to do the following
MSet
).MSet
.
In my understanding, you're saying that (2) doesn't happen or there is a bug. Which could be true, but I am not certain.If this mismatch is happening you should be able to see it by taking the default objects in the minfiData
package, subset the objects to remove some probes or some loci and then observe that the probeType changes. Right? Could you try to make an example along those lines? That would make it much easier to reason about.
You're right in saying that a full manifest object can be much bigger than the size of the MSet.
The issue actually arises when the manifest is smaller than the MSet (not bigger). We observe EPIC v1 idats in the wild that have more probes than what is contained in the B4/B5 EPIC manifest. I have already looked at the minfiData (450k as I remember) and it has the same number of probes as the manifest, so we won't be able to replicate the issue with this test data. But you can see example output from one of our datasets below.
> MSet <- preprocessRaw(rgset)
Loading required package: IlluminaHumanMethylationEPICmanifest
> dim(MSet)
[1] 866091 95
> probe.type <- getProbeType(MSet, withColor = TRUE)
Loading required package: IlluminaHumanMethylationEPICanno.ilm10b4.hg19
> length(probe.type)
[1] 865859
I've done some comparisons of my PR with the latest devel branch. I have taken an RGSet with 866091 probes and made a copy of the RGSet that is subset to the bead addresses found in the B5 manifest (excluding the manufacture change probes). I then ran preprocessNoob() on both RGSets using both the official version and my fixed version of minfi. When comparing the Noob-adjusted betas for the two RGSets, the official version produces a lot of discordance while the differences in the fixed version of minfi are negligible (see attached plots). Excluding a few hundred out-of-band signals from the background estimate shouldn't make much of a difference, so I suspect that the discordance seen in the official version of minfi is because changing the number of rows in the MSet affects which probes get Noob-adjusted.
Oh, that is interesting. I could imagine we have an assumption that the MSet is always a subset of the Manifest. I could see the potential for a bug here.
The right fix however, is to deal with this inside of getAnnotation
to
make sure the dimensions match up.
Could you share a couple of samples from your dataset, so any fixes could be checked.
On Thu, Feb 29, 2024 at 1:07 PM Jonathon LeFaive @.***> wrote:
You're right in saying that a full manifest object can be much bigger than the size of the MSet.
The issue actually arises when the manifest is smaller than the MSet (not bigger). We observe EPIC v1 idats in the wild that have more probes than what is contained in the B4/B5 EPIC manifest. I have already looked at the minfiData (450k as I remember) and it has the same number of probes as the manifest, so we won't be able to replicate the issue with this test data. But you can see example output from one of our datasets below.
MSet <- preprocessRaw(rgset) Loading required package: IlluminaHumanMethylationEPICmanifest dim(MSet) [1] 866091 95 probe.type <- getProbeType(MSet, withColor = TRUE) Loading required package: IlluminaHumanMethylationEPICanno.ilm10b4.hg19 length(probe.type) [1] 865859
I've done some comparisons of my PR with the latest devel branch. I have taken an RGSet with 866091 probes and made a copy of the RGSet that is subset to the bead addresses found in the B5 manifest (excluding the manufacture change probes). I then ran preprocessNoob() on both RGSets using both the official version and my fixed version of minfi. When comparing the Noob-adjusted betas for the two RGSets, the official version produces a lot of discordance while the differences in the fixed version of minfi are negligible (see attached plots). Excluding a few hundred out-of-band signals from the background estimate shouldn't make much of a difference, so I suspect that the discordance seen in the official version of minfi is because changing the number of rows in the MSet affects which probes get Noob-adjusted.
minfi_noob_plots.png (view on web) https://github.com/hansenlab/minfi/assets/4069894/337496b6-f143-47ea-a4da-4614d3058e83
— Reply to this email directly, view it on GitHub https://github.com/hansenlab/minfi/issues/264#issuecomment-1971682054, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH6IDMLVMN5XHZQZ563YV5W47AVCNFSM6AAAAABDZNWLW6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRGY4DEMBVGQ . You are receiving this because you commented.Message ID: @.***>
-- Best, Kasper
I cannot share samples from our data but I could share the bead address list. I'm assuming you can generate an RGChannelSet object with fake (random values) red/green matrices using these addresses. Would that be useful to you?
Perhaps you could generate such a simulated RGChannelSet or perhaps - instead of simulating - you just have integer values in the channels representing different types of probes. That would be useful.
I am still surprised by this, because the called to preprocessRaw
should remove any probes/loci not in the manifest. But on the other hand, your plot is pretty convincing.
I have created a test case with a couple samples from GEO.
Another thing I discovered is that when using read.metharray.exp(..., force=T)
with idats that have different sizes (i.e., different number of bead addresses), the resulting methyl sets are of the same size as the probe type vector. So this issue can only be reproduced with an rgset made from idats of the same size. The test case shows this as well.
Running Rscript run.sh
from the minfi_issue
directory produces:
...
[1] "dim(rgset.1):"
[1] 1051815 1
[1] "dim(rgset.2):"
[1] 1051943 1
[1] "dim(rgset.force):"
[1] 1051539 2
Loading required package: IlluminaHumanMethylationEPICmanifest
[1] "dim(mset.1):"
[1] 866091 1
[1] "dim(mset.2):"
[1] 866238 1
[1] "dim(mset.force):"
[1] 865859 2
Loading required package: IlluminaHumanMethylationEPICanno.ilm10b4.hg19
[1] "length(probe.type.1):"
[1] 865859
[1] "length(probe.type.2):"
[1] 865859
[1] "length(probe.type.force):"
[1] 865859
Note: I had to use LZMA to compress this tar archive because the gzip version was too big for Github.
Another thing I discovered is that when using read.metharray.exp(..., force=T) with idats that have different sizes (i.e., different number of bead addresses), the resulting methyl sets are of the same size as the probe type vector. So this issue can only be reproduced with an rgset made from idats of the same size.
I just realized that this is because the set of probes that overlap between these two samples are all in the manifest. So if the intersection of probes across all samples are in the manifiest, then we don't have a problem. The issue can still exist if the idats have different sizes as long as the intersection of probes in the idats is larger than the manifest.
> length(intersect(rownames(mset.1), rownames(mset.2)))
[1] 865859
> dim(mset.force)
[1] 865859 2
> length(intersect(rownames(mset.1), getAnnotation(rgset.1)$Name))
[1] 865859
> length(intersect(rownames(mset.2), getAnnotation(rgset.2)$Name))
[1] 865859
I can replicate the issue.
The core of the issue is because of the (in hindsight wrong) to have separate Manifest and Annotation packages. This means that these packages may not be in sync. The issues happens when you use getAnnotation
on a MethylSet
/RatioSet
etc which was created using a Manifest package (usually by a call to preprocessRaw
) which results in more loci (CpGs) than is contained in the Annotation package. The diagnosis is simple: the output of getAnnotation
has a smaller number of rows compared to the input object (typically a MethylSet
).
This affects output from getAnnotation
as well as getProbeType
, getLocations
, getIslandStatus
, and getSnpInfo
. A short term fix is to not assume that the output of these functions are matched 1-1 with the input object, but to do an intersection. This does not happen in getProbeType
but it does happen in mapToGenome
. The long term fix is to no longer have separate Manifest and Annotation packages.
I need to do a more careful analysis of the potential impact of this bug, including a clear description (for users) of how this could happen with various types of IDAT files and array versions. It might be nice to have a have a call with the OP with some more details and their observations. For this reason, I am waiting on merging the pull request.
Is this still waiting on a call with the OP?
https://github.com/hansenlab/minfi/blob/77d33453ca7d438d72fea0f54f9eabde8ab54d97/R/preprocessNoob.R#L386-L392
The preprocessNoob function uses indices from the
getProbeType()
function to select probes by type using their row index (see code linked above). The problem is thatgetProbeType()
returns a vector based on the manifest which is unlikely to have the same length as the number of probes in the MethylSet. For example, the indices ind2.probes <- which(probe.type == "II")
do not map to indices for the same probes in the return matrices fromgetMeth(MSet)
andgetUnmeth(MSet)
, but are in fact used to select the signals for type II probes.I suspect that fixing this bug will improve the effectiveness of ssNoob in minfi.