Center the input matrix around the motif

danrlu commented 6 years ago

Hi,

Thank you for the tutorial and package, it's extremely helpful!! I was troubleshooting with my dataset and came across this issue. (I'm very sorry for moving the post around, thought it might be different from the previously reported issue.)

Here's an example with the original code:

library(CENTIPEDE.tutorial)

region=parse_region("2L:10-20")
region_len <- abs(region$end - region$start) + 1

item = list(
  strand = as.factor(c("+","+","-","-")),  
  pos = as.integer(c(15,20,2,12)),
  qwidth = as.integer(c(10,10,10,10))
)

################### copied from functions.R line 152-173 ###############

# take care of negative reads starting at end position
neg_shift <- item$qwidth * as.numeric(item$strand == "-")
item$pos <- item$pos + neg_shift

idx <- item$pos >= region$start & item$pos <= region$end
if (sum(idx) == 0) {
  return(rep(0, 2 * region_len))
}
strand <- item$strand[idx]
position <- item$pos[idx]

# Create a row that represents the flanking region surrounding a site,
# each column is a nucleotide position relative to the center of the motif
# match.
# The values in this matrix are the number of read start sites that occur
# at that position.
# Since we have two strands, the first half of the matrix represents the
# positive strand and the second have represents the negative strand.
is.neg <- as.numeric(strand == "-")
j <- 1 + position - min(position) + (region_len * is.neg)

as.numeric(table(factor(j, levels = seq(1, 2 * region_len))))

I wonder whether this is what is intended. This does get rid of the peak in the left edge of the cut probability plot (where position==min(position) and on "+" strand will put a 1 in the matrix).

j <- 1 + position - region$start + (region_len * is.neg)
as.numeric(table(factor(j, levels = seq(1, 2 * region_len))))

Thanks, Dan

slowkow commented 6 years ago

Thanks for opening this issue! When time allows, I will take a closer look at this.

It's definitely possible that my code has bugs. I'd be very grateful if you can write a minimal example that demonstrates any problems.

slowkow commented 6 years ago

Thanks again! I looked more closely right now, and I think you're correct about the calculations.

After making your change, I noticed that my vignette fails to build. I guess there's another problem now with my read counts, because the fitCentipede() fails to complete:

Quitting from lines 350-362 (centipede-tutorial.Rmd)
Error in quantile.default(x, c(TrimP, 1 - TrimP)) :
  missing values and NaN's not allowed if 'na.rm' is FALSE
Calls: <Anonymous> ... clipExtremes -> quantile -> quantile -> quantile.default

Execution halted

When time allows, I'll revisit this again and see if I can fix it.

Maybe I just have a low number of read counts in my matrix? I don't know.

danrlu commented 6 years ago

Hi Kamil,

Thank you for looking into this!

Sadly I get the same error message really frequently. Some motif/datasets work, some don't. As you suspected, larger amount of reads and more FIMO locations seemed to help (for example, loosening the p-value cutoff from FIMO results). But it's quite challenging to deal with this error as the data is the limiting factor here. I had to try out other packages in the end :(

Your package is super helpful regardless, otherwise I would have spent a week to get the data format right for CENTIPEDE, only to find out that it only works half of the time...

Thanks again!! Dan

slowkow commented 6 years ago

Thanks for letting me know that you're also running into this problem!

Also, I'm glad my code could serve some purpose. I originally wrote it when I was struggling to figure out how to get the data in the right format, and I thought maybe someone might save some time if I share it publicly. Too bad I didn't even get it right! Thanks again for fixing my mistake! Contributions like this are the best part of writing open source software 😄

slowkow / CENTIPEDE.tutorial

Center the input matrix around the motif #9