slowkow / CENTIPEDE.tutorial

:bug: How to use CENTIPEDE to determine if a transcription factor is bound.
https://slowkow.github.io/CENTIPEDE.tutorial
25 stars 13 forks source link

object 'sequence.name' not found #10

Closed amirshams84 closed 6 years ago

amirshams84 commented 6 years ago

I did try your tutorial with the meme-suite version4.12, I received this error while I wanted to create centipede data

cen <- centipede_data(
  bam_file   = "./mait_plus_rep1.bam",
  fimo_file  = "./tr.sites.gz",
  pvalue     = 1e-4,
  flank_size = 1000
)
Error in order(sequence.name, start, stop, -score) : 
  object 'sequence.name' not found
amirshams84 commented 6 years ago
# motif_id  motif_alt_id    sequence_name   start   stop    strand  score   p-value q-value matched_sequence
MA0019.1    Ddit3::Cebpa    chr6    79234475    79234486    +   11.0444 8.01e-05        CATTGCAATGCC
MA0019.1    Ddit3::Cebpa    chr8    89901788    89901799    -   11.4778 5.97e-05        TGATGCAATAGG
MA0019.1    Ddit3::Cebpa    chr6    80106972    80106983    +   12.3111 3.26e-05        GGCTGCAATCCT
MA0019.1    Ddit3::Cebpa    chr8    38938952    38938963    +   10.8778 8.66e-05        gagtgcaatggt
MA0019.1    Ddit3::Cebpa    chr8    38938960    38938971    +   12.3111 3.26e-05        tggtgcaatctc
MA0019.1    Ddit3::Cebpa    chr3    107522958   107522969   -   11.4444 6.11e-05        ACATGCAACCCG
MA0019.1    Ddit3::Cebpa    chr7    87934132    87934143    -   10.8    9.18e-05        CGGTGCACTCGC
MA0019.1    Ddit3::Cebpa    chr4    129093221   129093232   +   10.7    9.84e-05        CGGTGCAATCTG
MA0019.1    Ddit3::Cebpa    chr3    120742836   120742847   -   14.3333 5.04e-06        AAATGCAATGCC
slowkow commented 6 years ago

Thanks for opening this issue! I appreciate it. I probably have a bug in my code.

You might be able to solve this problem by running each line from the centipede_data() function one line at a time.

I would be very grateful if you can explain what went wrong! If you're willing to make a pull request, I would be happy to review and accept it.


Here's the code you should try running:

https://github.com/slowkow/CENTIPEDE.tutorial/blob/master/R/functions.R#L86-#L132

  # Read the FIMO output file.
  sites <- read_fimo(fimo_file = fimo_file, pvalue = pvalue, ...)

  # Upstream flank, the center of PWM match, and downstream flank.
  sites$start <- sites$start - flank_size
  sites$stop <- sites$stop + flank_size

  # Index the BAM file if necessary.
  bam_index_file <- sprintf("%s.bai", bam_file)
  if (!file.exists(bam_index_file)) {
    message("Indexing the BAM file... this may take several minutes.")
    indexBam(bam_file, overwrite = FALSE)
  }

  # Extract reads that overlap the PWM sites.
  bam_list <- Rsamtools::scanBam(
    file = bam_file,
    param = Rsamtools::ScanBamParam(
      which = GRanges(
        seqnames = sites$sequence.name,
        ranges = IRanges(
          start = sites$start,
          end = sites$stop
        )
      ),
      what = c("strand", "pos", "qwidth")
    )
  )

  if (length(bam_list) == 0) {
    stop(sprintf("No reads fall in sites from '%s'\n", fimo_file))
  }

  # Convert the list of "chr:start-end" regions to a dataframe.
  regions <- lapply(names(bam_list), parse_region)
  regions <- data.frame(
    sequence.name = unlist(sapply(regions, function(x) x["chrom"])),
    start = as.numeric(unlist(sapply(regions, function(x) x["start"]))),
    stop = as.numeric(unlist(sapply(regions, function(x) x["end"])))
  )
  regions$index <- 1:nrow(regions)

  # Grab score, p.value, and q.value, but drop regions with no reads.
  regions <- merge(regions, sites)

  # Sort the regions by coordinate.
  regions <- regions[with(regions, order(sequence.name, start, stop)),]
  bam_list <- bam_list[regions$index]
amirshams84 commented 6 years ago

Hi Kamil thanks for a quick reply from what I understand, it seems FIMO in MEME-SUITES4.12 added another column to the result which may cause an issue in the downstream analysis I want to read my fimo file with read_fimo and the same error appeared

I really appreciate any help regards Amir

On Fri, Mar 16, 2018 at 2:31 PM, Kamil Slowikowski <notifications@github.com

wrote:

Thanks for opening this issue! I appreciate it. I probably have a bug in my code.

You might be able to solve this problem by running each line from the centipede_data() function one line at a time.

I would be very grateful if you can explain what went wrong! If you're willing to make a pull request, I would be happy to review and accept it.

Here's the code you should try running:

https://github.com/slowkow/CENTIPEDE.tutorial/blob/ master/R/functions.R#L86-#L132

Read the FIMO output file.

sites <- read_fimo(fimo_file = fimo_file, pvalue = pvalue, ...)

Upstream flank, the center of PWM match, and downstream flank.

sites$start <- sites$start - flank_size sites$stop <- sites$stop + flank_size

Index the BAM file if necessary.

bam_index_file <- sprintf("%s.bai", bam_file) if (!file.exists(bam_index_file)) { message("Indexing the BAM file... this may take several minutes.") indexBam(bam_file, overwrite = FALSE) }

Extract reads that overlap the PWM sites.

bam_list <- Rsamtools::scanBam( file = bam_file, param = Rsamtools::ScanBamParam( which = GRanges( seqnames = sites$sequence.name, ranges = IRanges( start = sites$start, end = sites$stop ) ), what = c("strand", "pos", "qwidth") ) )

if (length(bam_list) == 0) { stop(sprintf("No reads fall in sites from '%s'\n", fimo_file)) }

Convert the list of "chr:start-end" regions to a dataframe.

regions <- lapply(names(bam_list), parse_region) regions <- data.frame( sequence.name = unlist(sapply(regions, function(x) x["chrom"])), start = as.numeric(unlist(sapply(regions, function(x) x["start"]))), stop = as.numeric(unlist(sapply(regions, function(x) x["end"]))) ) regions$index <- 1:nrow(regions)

Grab score, p.value, and q.value, but drop regions with no reads.

regions <- merge(regions, sites)

Sort the regions by coordinate.

regions <- regions[with(regions, order(sequence.name, start, stop)),] bam_list <- bam_list[regions$index]

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/slowkow/CENTIPEDE.tutorial/issues/10#issuecomment-373805277, or mute the thread https://github.com/notifications/unsubscribe-auth/AMtB2NjvVCkrpW89bqtgdZOT7yHJBOnzks5tfATugaJpZM4SuOp5 .

slowkow commented 6 years ago

Could I ask you to please try reinstalling with

devtools::install_github("slowkow/CENTIPEDE.tutorial")

and try again?

amirshams84 commented 6 years ago

Hi Kamil runx1 <- centipede_data(bam_file = "../jfarber-20160119-P1/spsingh-20171019-D2/VA2.7+CCR6+CCR2+/hg38/align/rep1/AB2955-L1_R1.trim.PE2SE.bam", fimo_file = "../Centepede/runx1.txt.gz", pvalue = 1e-4,flank_size = 100) Parsed with column specification: cols( #pattern name = col_character(), sequence name = col_character(), start = col_integer(), stop = col_integer(), strand = col_character(), score = col_double(), p-value = col_double(), q-value = col_character(), matched sequence = col_character() ) Error in if (sum(idx) == 0) { : missing value where TRUE/FALSE needed

slowkow commented 6 years ago

Could I ask you to try running the code line by line? This might be a quick way to find the source of the error.

Try starting at line 92: https://github.com/slowkow/CENTIPEDE.tutorial/blob/master/R/functions.R#L92

slowkow commented 6 years ago

Feel free to re-open this issue if you need more help.