A peak will never be discovered if it is in the first or last two positions of the density function

nicbarker commented 6 years ago

Hi, If you take a look at the comparison done on /R/helper_functions.R:414, I believe it will always return null if the single peak is the first element of the dens array.

For example, given the following data in dens, noting that element [1] is the largest,

Browse[4]> dens$y
  [1] 1.906371e-03 1.624343e-03 1.347483e-03 1.085672e-03 8.487902e-04
  [6] 6.448108e-04 4.740793e-04 3.350340e-04 2.261130e-04 1.452183e-04
 [11] 8.810587e-05 4.999520e-05 2.610589e-05 1.212379e-05 5.599970e-06
 [16] 4.551809e-06 6.996681e-06 1.095196e-05 1.474551e-05 1.794717e-05
 [21] 2.043727e-05 2.209613e-05 2.284806e-05 2.279326e-05 2.207590e-05

Plotted it looks like this:

The comparison is done as follows:

for (i in 1:(length(d.y)-2*w)) {
    if (d.y[i+w] > d.y[(i+w+1):(i+2*w)] && d.y[i+w] > d.y[i:(i+w-1)] && d.y[i+w] > peak.removal*max(d.y))
}

Solved for the first iteration of the loop i = 1 and the hardcoded value of w = 1

for (i in 1:(length(d.y)-2*w)) {
    if (d.y[2] > d.y[(3):(3)] && d.y[2] > d.y[1:(1)] && d.y[2] > peak.removal*max(d.y))
}

Note that because of the default value of w <- 1, while element [1] is never compared directly through iteration, the comparison of d.y[i+w] > d.y[i:(i+w-1)] -> d.y[2] > d.y[1:(1)] will prevent element [2] from being selected as the peak.

Also because of the way the loop is designed, the last two elements in dens will also not be checked:

for (i in 1:(length(d.y)-2*w)) {

I'm not sure if there is some reason that the first element is being ignored on purpose, but in my particular case it means that the following error gets thrown when running flowDensity:

Error in if (missing(p2) | is.na(p2)) p2 <- min(dens$x) :
  argument is of length zero
In addition: Warning message:
In is.na(p2) : is.na() applied to non-(list or vector) of type 'NULL'

traceback():

5: .getIntersect(dens, peaks[i], peaks[i + 1])
4: .densityGating(f = f, channel = channels[2], use.percentile = use.percentile[2],
       percentile = percentile[2], use.upper = use.upper[2], upper = upper[2],
       avg = avg[2], all.cuts = all.cuts[2], sd.threshold = sd.threshold[2],
       n.sd = n.sd[1], alpha = alpha[1], debris.gate = debris.gate[2],
       tinypeak.removal = 1/25)
3: .deGate2D(f = obj, channels = channels, position = position,
       ...)
2: flowDensity(obj = samp, channels = c(1, 14), position = c(TRUE,
       FALSE))
1: flowDensity(obj = samp, channels = c(1, 14), position = c(TRUE,
       FALSE))

You can see here that .getIntersect is throwing an error because peaks is null, and because of how R handles comparisons to NULL

Try this tiny example:

p2 <- NULL
if (missing(p2) | is.na(p2)) print('test')

If there isn't any particular reason that the first element is being skipped, it can be fixed by replacing the if statement at 414 with the following:

addPeak <- FALSE
if (i == 1 && d.y[i] > d.y[(i+1):(i+2*w)] && d.y[i] > peak.removal*max(d.y)) {
    addPeak <- TRUE
}
else if (d.y[i+w] > d.y[(i+w+1):(i+2*w)] && d.y[i+w] > d.y[i:(i+w-1)] && d.y[i+w] > peak.removal*max(d.y)) {
    addPeak <- TRUE
}

if (addPeak == TRUE) {
    peaks <- c(peaks, d$x[i+w])
    peaks.ind <- c(peaks.ind, i+w)
}

Furthermore, what is the purpose of the w parameter? It seems like it may have been intended to protect against lots of small peaks being returned in the case of having some jitter across the density distribution, but the implementation is very strange - the two ranges that it compares:

d.y[i+w] > d.y[(i+w+1):(i+2*w)]
# Value where w = 10 and i = 1
d.y[11] > d.y[(12):(21)]

and

d.y[i+w] > d.y[i:(i+w-1)]
# Value where w = 10 and i = 1
d.y[11] > d.y[1:10]

Clips off all the values that are < w and > length - 2 * w (i.e you lose twice as much data off the end of dens that you do off the front, was this intended to be implemented this way?

mehrnoushmalek commented 6 years ago

It really depends how many points you have. When you have one peak, you should either provide, sd.threshold, upper or percentile. The default won't work if the first point is the peak. Again I will look into it whwn I get back

nicbarker commented 6 years ago

No problem, take your time.

mehrnoushmalek commented 6 years ago

I'd made some changes, could you check to see if it works for you? The purpose of w is for myself in exploring different options, as you see w=1, and it can't be changed.

nicbarker commented 6 years ago

Hi, Just to make sure, this is the commit you're referring to: https://github.com/mehrnoushmalek/flowDensity/commit/175b84fc89a870ba5594cbd282430269eb79d428

While this may solve the problem when there's a single peak at the first position, there are still going to be edge cases if there are multiple peaks. For example, consider this very rudimentary density example:

position	0	1	2	3	4	5	6	7
density	5	4	1	1	1	4	1	1

In this case, the getPeaks function will discover the peak at position 5 (the second 4), but not the peak at position 0. This density function will then be incorrectly label this as a single peak distribution, when in reality there are two peaks. (one at position 0, the other at position 5).

The changes that you made will also mean that now if there is a distribution that legitimately has no peaks in it:

position	0	1	2	3	4	5	6	7
density	1	1	1	1	1	1	1	1

The max of density will be returned, which is 1 in this case.

mehrnoushmalek commented 6 years ago

I wouldn't consider position 0 to be a peak anyway, based on your first example, cause that's not a peak (never had this in flowData), but with the current changes it would be considered a peak. Also could you just send me the vector or your flowFrame, that makes things easier comparing to me trying to replicate your data with these examples.

mehrnoushmalek commented 6 years ago

If you look at this example, you'll see that the second peak is recorded, but the first point wouldn't be considered a peak.

example

getPeaks(dens.obj)--- $Peaks [1] -1.038334 $P.ind [1] 18 $P.h [1] 0.00185

nicbarker commented 6 years ago

To give you a more concrete example of where this is happening, have a look at this case.

This is a flowjo biexp histogram of ~300,000 events using the Live / Dead Near IR marker.

Because the R dens() function returns 512 density probabilities:

> length(dens$y)
[1] 512

Even though it's obvious that there's a peak when you graph using flowjo, when you reduce the density function to only 512 points of resolution, the peak that is between -60 and about 140 gets compressed into position [1] of the density function (given that it's 512 equally spaced points over a range of ~262k, the resolution of each point is about 500, which is much larger than the width of the peak itself).

So even though there is a real peak here that humans can see, it gets crushed into the first element of the probability array and as a result ignored completely. This causes flowdensity to make the cutoff call in the wrong place, because it guesses after the second peak rather than between the first and second peaks.

My apologies if I'm not quite understanding how flowdensity is supposed to work - I'll see if I can upload some data today for you to have a look yourself. It's also possible that I'm not aware of some options that I can specify to give the density function better resolution closer to zero on the x axis.

Have you encountered problems with the resolution of the density function before? i.e. having one or several peaks very close to zero that users are visualising using the flowjo biexp scale?

nicbarker commented 6 years ago

Hi, I've been doing some more experimentation on our end and I realised that I can solve this problem by doing a biexponential transform (it's built into flowjo) on the data before running it through flowdensity. This will spread the data out across 1024 bins and give me the resolution close to zero that I need in this case. Thanks for all the help, feel free to close the issue if you think the root cause isn't that important.

mehrnoushmalek commented 6 years ago

Hello Nic,

Sorry for not replying, I was about to tell you that it's probably some issue with transformation. When you open it on flowJo, there's the T custom scale, that you can change till you get the right scaling for the peaks. Also in R you can try to use flowJoTrans() or estimateLogicle on your raw data. I will work on the getPeaks function for those cases anyway, but please let me know if you have any other issue.

Cheers,

mehrnoushmalek / flowDensity

A peak will never be discovered if it is in the first or last two positions of the density function #2