mkolovic commented 9 years ago

I am attempting to run a local workflow mirroring run-PopSV-local.R from the scripts folder. I am using a small sample of bam files to learn the package and make sure I can run everything. When running:

samp.qc.o = qc.samples(files.df, bins.df, ref.samples, outfile.prefix=bc.all.f)

I receive the error:

Error in if (any(med == 0)) med[which(med == 0)] = 1 : 
  missing value where TRUE/FALSE needed

Some additional information that I think may be helpful:

I am using the forPaper branch
My original bam.tsv did not include a group column to indicate which samples are reference samples, so neither does files.df .
Due to the above, ref.samples = files.df[ c(1,5,7,8), 1] was used to grab the reference sample names, i.e. they are hard coded in for now, this will change for future workflows.

mkolovic commented 8 years ago

I ran testthat.R and the output is the following:

$ sudo Rscript testthat.R 
Loading required package: methods
Warning message:
replacing previous import by ‘data.table::shift’ when loading ‘PopSV’ 
testthat results ================================================================
OK: 167 SKIPPED: 0 FAILED: 0

jmonlong commented 8 years ago

From your error message and your description, the most likely problem might be an error in the previous step (bin.bam/correct.GC). You can check that the results are correct by running the optional block in run-PopSV-local.R that uses quick.count and qc.samples.cluster (right before the qc.samples step). You could also check manually one of the *-bc-gcCor.tsv.bgz in the folder of one sample.

Incorrect sample names could also produce this error, can you check that files.df[ c(1,5,7,8), 1] does give you sample names ?

I tried running a toy example with only 4 reference samples as you did, and didn't get any error.

Some other comments :

Good to see the unit tests ran successfully.
forPaper branch is the one to use. It's the most up-to-date and will become the release version in a week or so.
It's fine if you don't have a group information in files.df etc. By default all samples will be used as reference. Or you can manually define the ref.samples as you tried.

Let me know how it goes, I will add more informative error messages when the issue is found. Thanks in advance for the feedback.

mkolovic commented 8 years ago

Summary

The error starts within qc.samples when read.bc.samples is called.
Specifically when bc.l is created a small fraction of the entries are non-zero, so when median is applied to these entries to calculate med.samp and med.med they end up being zero.
Then med.med and med.samp are used to compute bc.df sample entries but med.med/med.samp results in NaN due to division by zero.
bc.df is then passed to medvar.norm.internal which performs a median on the NaN values resulting in med being full of NA
if ((any(med == 0)) returns NA where a boolean is expected resulting in an error.
This stems from low coverage of chromosome 22, since only 0.02% of the regions have non-zero bin count.
Is there a method to restrict to only regions that have been covered?

Thanks.

Full Debug

I have run everything before the optional commands, the environment looks like:

The working directory, 20 different samples:

$ ls
bams.tsv
bins.Rdata
GL3610_S9_L001_R1_001
GL3610_S9_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL3610_S9_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL3724_S20_L001_R1_001
GL3724_S20_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL3724_S20_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL3767_S21_L001_R1_001
GL3767_S21_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL3767_S21_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL3893_S1_L001_R1_001
GL3893_S1_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL3893_S1_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4033_S10_L001_R1_001
GL4033_S10_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4033_S10_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4062_S22_L001_R1_001
GL4062_S22_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4062_S22_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4102_S11_L001_R1_001
GL4102_S11_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4102_S11_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4202_S12_L001_R1_001
GL4202_S12_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4202_S12_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4220_S19_L001_R1_001
GL4220_S19_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4220_S19_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4279_S12_L001_R1_001
GL4279_S12_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4279_S12_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4645_S13_L001_R1_001
GL4645_S13_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4645_S13_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4654_S23_L001_R1_001
GL4654_S23_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4654_S23_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4817_S14_L001_R1_001
GL4817_S14_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4817_S14_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5054_S24_L001_R1_001
GL5054_S24_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5054_S24_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5536_S15_L001_R1_001
GL5536_S15_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5536_S15_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5606_S16_L001_R1_001
GL5606_S16_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5606_S16_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5718_S17_L001_R1_001
GL5718_S17_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5718_S17_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5735_S18_L001_R1_001
GL5735_S18_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5735_S18_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5743_S15_L001_R1_001
GL5743_S15_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5743_S15_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5944_S16_L001_R1_001
GL5944_S16_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5944_S16_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai

The R environment:

> ls()
[1] "bam.files"   "bc.o"        "bins.df"     "bin.size"    "files.df"   
[6] "ref.samples"
> head(bins.df, n=3)
  chr start  end GCcontent
1  22     1 1000         0
2  22  1001 2000         0
3  22  2001 3000         0
> head(subset(bins.df, GCcontent!=0), n=3)
      chr    start      end GCcontent
16051  22 16050001 16051000     0.431
16052  22 16051001 16052000     0.416
16053  22 16052001 16053000     0.422
> nrow(bins.df)
[1] 51305
> nrow(subset(bins.df, GCcontent!=0))
[1] 34898
> head(files.df, n=3)
                  sample
1  GL3610_S9_L001_R1_001
2 GL3724_S20_L001_R1_001
3 GL3767_S21_L001_R1_001
                                                                                                                        bam
1  /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
   group
1 normal
2      ?
3      ?
                                                                                           bc
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc.tsv
                                                                                            bc.gz
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc.tsv.bgz
                                                                                              bc.gc
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor.tsv
                                                                                               bc.gc.gz
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor.tsv.bgz
                                                                                              bc.gc.norm
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor-norm.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor-norm.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor-norm.tsv
                                                                                               bc.gc.norm.gz
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor-norm.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor-norm.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor-norm.tsv.bgz
                                                                                           z
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-z.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-z.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-z.tsv
                                                                                            z.gz
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-z.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-z.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-z.tsv.bgz
                                                                                           fc
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-fc.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-fc.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-fc.tsv
                                                                                            fc.gz
1   /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-fc.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-fc.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-fc.tsv.bgz
> ref.samples
[1] "GL3610_S9_L001_R1_001"  "GL3893_S1_L001_R1_001"  "GL4062_S22_L001_R1_001"
[4] "GL4102_S11_L001_R1_001"
> head(bc.o, n=3)
[1] "/home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor.tsv.bgz"  
[2] "/home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor.tsv.bgz"
[3] "/home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor.tsv.bgz"

Check output of bin.bam and correct.GC, note that bgzip refuses to unzip unless the extension is .gz, a fraction of chr22 has non-zero bin count:

$ bgzip -cd GL3610_S9_L001_R1_001-bc.tsv.gz | wc -l
51306
$ bgzip -cd GL3610_S9_L001_R1_001-bc.tsv.gz | awk '$4!=0 { print }' | wc -l
1176
$ bgzip -cd GL3610_S9_L001_R1_001-bc.tsv.gz | awk '$4!=0 { print }' | less
22      42347001        42348000        2
22      42372001        42373000        2
22      42374001        42375000        2
22      42433001        42434000        6
22      42478001        42479000        2
22      42522001        42523000        900
22      42523001        42524000        1761
22      42524001        42525000        2019
22      42525001        42526000        1296
22      42526001        42527000        1313
22      42527001        42528000        196
22      42535001        42536000        174
22      42536001        42537000        1216
22      42537001        42538000        1385
22      42538001        42539000        1297
22      42539001        42540000        187
22      42540001        42541000        1068
22      42545001        42546000        180
22      42546001        42547000        1150
22      42547001        42548000        827
22      42548001        42549000        442
22      42549001        42550000        332
22      42550001        42551000        641
22      42551001        42552000        357
22      42561001        42562000        2
22      42597001        42598000        2
$ bgzip -cd GL3610_S9_L001_R1_001-bc-gcCor.tsv.gz | wc -l
51306
$ bgzip -cd GL3610_S9_L001_R1_001-bc-gcCor.tsv.gz | awk '$4!=0 { print }' | wc -l
1082
$ bgzip -cd GL3610_S9_L001_R1_001-bc-gcCor.tsv.gz | awk '$4!=0 { print }' | less
22      42347001        42348000        0.58
22      42372001        42373000        0.34
22      42374001        42375000        1.21
22      42478001        42479000        3.53
22      42522001        42523000        156.78
22      42523001        42524000        289.63
22      42524001        42525000        330.36
22      42525001        42526000        209.38
22      42526001        42527000        257.87
22      42527001        42528000        54.38
22      42535001        42536000        187.3
22      42536001        42537000        200
22      42537001        42538000        231.55
22      42538001        42539000        205.73
22      42539001        42540000        29.68
22      42540001        42541000        195.13
22      42545001        42546000        86.17
22      42546001        42547000        184.27
22      42547001        42548000        144.06
22      42548001        42549000        70.31
22      42549001        42550000        70.24
22      42550001        42551000        174.5
22      42551001        42552000        124.16
22      42561001        42562000        0.7
22      42597001        42598000        9.32

The optional clustering produces the following output:
so it appears that everything is ok and we proceed:

> bc.all.f = "bc-gcCor-all.tsv"
> pdf("sampQC.pdf")
> samp.qc.o = qc.samples(files.df, bins.df, outfile.prefix=bc.all.f, ref.samples=ref.samples)
Error in if (any(med == 0)) med[which(med == 0)] = 1 : 
  missing value where TRUE/FALSE needed
> traceback()
2: medvar.norm.internal(bc.df[, ref.samples])
1: qc.samples(files.df, bins.df, outfile.prefix = bc.all.f, ref.samples = ref.samples)
> dev.off()
null device 
          1

So the error is happening when qc.samples calls medvar.norm.internal, we will take a closer look with debug():

> pdf("sampQC.pdf")
> debug(qc.samples)
> samp.qc.o = qc.samples(files.df, bins.df, outfile.prefix=bc.all.f, ref.samples=ref.samples)
** outputs function body and enters debug mode**
Browse[2]> n
debug: if (!all(c("chr", "start", "end") %in% colnames(bin.df))) {
    stop("Missing column in 'bin.df'. 'chr', 'start' and 'end' are required.")
}
Browse[2]> n
debug: if (!all(c("sample", col.bc) %in% colnames(files.df))) {
    stop("Missing column in 'files.df'. 'sample', '", col.bc, 
        "' are required.")
}
Browse[2]> n
debug: if (is.null(ref.samples)) {
    ref.samples = as.character(files.df$sample)
}
Browse[2]> n
debug: if (is.null(nb.ref.samples)) {
    nb.ref.samples = length(ref.samples)
}
Browse[2]> n
debug: nb.ref.samples = length(ref.samples)
Browse[2]> n
debug: read.bc.samples <- function(df, files.df, nb.cores = 1, med.med = NULL, 
    med.samp = NULL, file = NULL, append.f = FALSE, sub.bc = NULL) {
    bc.df = createEmptyDF(c("character", rep("integer", 2), rep("numeric", 
        nrow(files.df))), nrow(df))
    colnames(bc.df) = c("chr", "start", "end", as.character(files.df$sample))
    bc.df$chr = df$chr
    bc.df$start = df$start
    bc.df$end = df$end
    if (nb.cores > 1) {
        bc.l = parallel::mclapply(as.character(files.df[, col.bc]), 
            function(fi) {
                read.bedix(fi, df)[, 4]
            }, mc.cores = nb.cores)
    }
    else {
        bc.l = lapply(as.character(files.df[, col.bc]), function(fi) {
            read.bedix(fi, df)[, 4]
        })
    }
    if (is.null(med.med) & is.null(med.samp)) {
        med.samp = lapply(bc.l, median, na.rm = TRUE)
        med.med = median(unlist(med.samp), na.rm = TRUE)
    }
    for (samp.i in 1:nrow(files.df)) {
        bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] * 
            med.med/med.samp[[samp.i]], 2)
    }
    bc.df = bc.df[order(bc.df$chr, bc.df$start, bc.df$end), ]
    if (!is.null(file)) {
        write.table(bc.df, file = file, quote = FALSE, row.names = FALSE, 
            sep = "\t", append = append.f, col.names = !append.f)
    }
    if (!is.null(sub.bc)) {
        bc.df = bc.df[sample(1:nrow(bc.df), min(c(nrow(bc.df), 
            sub.bc))), ]
    }
    return(bc.df)
}
Browse[2]> n
debug: center.pt <- function(m) {
    dm = as.matrix(dist(m))
    diag(dm) = NA
    rownames(m)[which.min(apply(dm, 2, mean, na.rm = TRUE))]
}
Browse[2]> debug(read.bc.samples)
Browse[2]> n
debug: files.df = files.df[which(files.df$sample %in% ref.samples), 
    ]
Browse[2]> ls()
 [1] "appendIndex.outfile" "bin.df"              "center.pt"          
 [4] "chunk.size"          "col.bc"              "files.df"           
 [7] "nb.cores"            "nb.ref.samples"      "outfile.prefix"     
[10] "plot"                "read.bc.samples"     "ref.samples"        
Browse[2]> n
debug: pc.all.df = bc.rand = NULL
Browse[2]> n
debug: if (length(ref.samples) > nb.ref.samples) {
    sparse.pts <- function(m, nb.pts) {
        dm = as.matrix(dist(m))
        diag(dm) = NA
        pts.jj = 1:nrow(dm)
        pts.ii = which.max(apply(dm, 1, max, na.rm = TRUE))
        pts.jj = pts.jj[-pts.ii]
        while (length(pts.ii) < nb.pts) {
            m.ii = which.max(apply(dm[pts.ii, pts.jj, drop = FALSE], 
                2, min, na.rm = TRUE))
            pts.ii = c(pts.ii, pts.jj[m.ii])
            pts.jj = pts.jj[-m.ii]
        }
        m[pts.ii, ]
    }
    bc.rand = read.bc.samples(bin.df[sample.int(nrow(bin.df), 
        min(c(nrow(bin.df)/2, 1000))), ], files.df, med.med = 1, 
        med.samp = rep(list(1), nrow(files.df)))
    bc.mv = medvar.norm.internal(bc.rand[, ref.samples])
    pc.all = prcomp(t(na.exclude(bc.mv)))
    pc.all.df = data.frame(pc.all$x[, 1:3])
    pc.all.df$sample = ref.samples
    sp.o = sparse.pts(pc.all$x[, 1:2], nb.ref.samples)
    pc.all.df$reference = pc.all.df$sample %in% rownames(sp.o)
    if (plot) {
        PC1 = PC2 = reference = NULL
        print(ggplot2::ggplot(pc.all.df, ggplot2::aes(x = PC1, 
            y = PC2, size = reference)) + ggplot2::geom_point(alpha = 0.8) + 
            ggplot2::theme_bw() + ggplot2::scale_size_manual(values = 2:3))
    }
    ref.samples = rownames(sp.o)
    bc.rand = bc.rand[, c("chr", "start", "end", ref.samples)]
}
Browse[2]> n
debug: files.df = files.df[which(files.df$sample %in% ref.samples), 
    ]
Browse[2]> n
debug: bin.df = bin.df[order(bin.df$chr, bin.df$start, bin.df$end), 
    ]
Browse[2]> n
debug: if (nrow(bin.df) < 1.3 * chunk.size) {
    bc.df = read.bc.samples(bin.df, files.df, nb.cores)
    write.table(bc.df, file = outfile.prefix, quote = FALSE, 
        row.names = FALSE, sep = "\t")
} else {
    nb.chunks = ceiling(nrow(bin.df)/chunk.size)
    bin.df$chunk = rep(1:nb.chunks, each = chunk.size)[1:nrow(bin.df)]
    analyze.chunk <- function(df) {
        ch.nb = as.numeric(df$chunk[1])
        df.o = read.bedix(as.character(files.df[1, col.bc]), 
            df)
        read.bc.samples(df.o, files.df, nb.cores, med.med, med.samp, 
            file = outfile.prefix, append.f = ch.nb > 1, sub.bc = chunk.size/nb.chunks)
    }
    if (is.null(bc.rand)) {
        bc.rand = read.bc.samples(bin.df[sample.int(nrow(bin.df), 
            min(c(nrow(bin.df)/2, 1000))), ], files.df, med.med = 1, 
            med.samp = rep(list(1), nrow(files.df)))
    }
    med.samp = lapply(as.character(files.df$sample), function(samp.i) median(bc.rand[, 
        samp.i], na.rm = TRUE))
    med.med = median(unlist(med.samp), na.rm = TRUE)
    bc.res = lapply(unique(bin.df$chunk), function(chunk.i) {
        analyze.chunk(bin.df[which(bin.df$chunk == chunk.i), 
            ])
    })
    bc.df = plyr::ldply(bc.res, identity)
}
Browse[2]> n
debug: bc.df = read.bc.samples(bin.df, files.df, nb.cores)
Browse[2]> n
debugging in: read.bc.samples(bin.df, files.df, nb.cores)
debug: {
    bc.df = createEmptyDF(c("character", rep("integer", 2), rep("numeric", 
        nrow(files.df))), nrow(df))
    colnames(bc.df) = c("chr", "start", "end", as.character(files.df$sample))
    bc.df$chr = df$chr
    bc.df$start = df$start
    bc.df$end = df$end
    if (nb.cores > 1) {
        bc.l = parallel::mclapply(as.character(files.df[, col.bc]), 
            function(fi) {
                read.bedix(fi, df)[, 4]
            }, mc.cores = nb.cores)
    }
    else {
        bc.l = lapply(as.character(files.df[, col.bc]), function(fi) {
            read.bedix(fi, df)[, 4]
        })
    }
    if (is.null(med.med) & is.null(med.samp)) {
        med.samp = lapply(bc.l, median, na.rm = TRUE)
        med.med = median(unlist(med.samp), na.rm = TRUE)
    }
    for (samp.i in 1:nrow(files.df)) {
        bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] * 
            med.med/med.samp[[samp.i]], 2)
    }
    bc.df = bc.df[order(bc.df$chr, bc.df$start, bc.df$end), ]
    if (!is.null(file)) {
        write.table(bc.df, file = file, quote = FALSE, row.names = FALSE, 
            sep = "\t", append = append.f, col.names = !append.f)
    }
    if (!is.null(sub.bc)) {
        bc.df = bc.df[sample(1:nrow(bc.df), min(c(nrow(bc.df), 
            sub.bc))), ]
    }
    return(bc.df)
}
Browse[3]> n
debug: bc.df = createEmptyDF(c("character", rep("integer", 2), rep("numeric", 
    nrow(files.df))), nrow(df))
Browse[3]> n
debug: colnames(bc.df) = c("chr", "start", "end", as.character(files.df$sample))
Browse[3]> head(bc.df, n=2)
    V1 V2 V3 V4 V5 V6 V7
1 <NA> NA NA NA NA NA NA
2 <NA> NA NA NA NA NA NA
Browse[3]> n
debug: bc.df$chr = df$chr
Browse[3]> head(bc.df, n=2)
   chr start end GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001
1 <NA>    NA  NA                    NA                    NA
2 <NA>    NA  NA                    NA                    NA
  GL4062_S22_L001_R1_001 GL4102_S11_L001_R1_001
1                     NA                     NA
2                     NA                     NA
Browse[3]> n
debug: bc.df$start = df$start
Browse[3]> n
debug: bc.df$end = df$end
Browse[3]> n
debug: if (nb.cores > 1) {
    bc.l = parallel::mclapply(as.character(files.df[, col.bc]), 
        function(fi) {
            read.bedix(fi, df)[, 4]
        }, mc.cores = nb.cores)
} else {
    bc.l = lapply(as.character(files.df[, col.bc]), function(fi) {
        read.bedix(fi, df)[, 4]
    })
}
Browse[3]> head(bc.df, n=2)
  chr start  end GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001
1  22     1 1000                    NA                    NA
2  22  1001 2000                    NA                    NA
  GL4062_S22_L001_R1_001 GL4102_S11_L001_R1_001
1                     NA                     NA
2                     NA                     NA
Browse[3]> n
debug: bc.l = lapply(as.character(files.df[, col.bc]), function(fi) {
    read.bedix(fi, df)[, 4]
})
Browse[3]> n
debug: if (is.null(med.med) & is.null(med.samp)) {
    med.samp = lapply(bc.l, median, na.rm = TRUE)
    med.med = median(unlist(med.samp), na.rm = TRUE)
}
Browse[3]> length(bc.l[[1]])
[1] 51305
Browse[3]> length(subset(bc.l[[1]], bc.l[[1]]!=0))
[1] 1081
Browse[3]> length(bc.l[[2]])
[1] 51305
Browse[3]> length(subset(bc.l[[2]], bc.l[[2]]!=0))
[1] 1525
Browse[3]> n
debug: med.samp = lapply(bc.l, median, na.rm = TRUE)
Browse[3]> n
debug: med.med = median(unlist(med.samp), na.rm = TRUE)
Browse[3]> med.samp
[[1]]
[1] 0

[[2]]
[1] 0

[[3]]
[1] 0

[[4]]
[1] 0

Browse[3]> n
debug: for (samp.i in 1:nrow(files.df)) {
    bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] * 
        med.med/med.samp[[samp.i]], 2)
}
Browse[3]> med.med
[1] 0
Browse[3]> med.med/med.samp[[1]]
[1] NaN
Browse[3]> head(bc.df)
  chr start  end GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001
1  22     1 1000                    NA                    NA
2  22  1001 2000                    NA                    NA
3  22  2001 3000                    NA                    NA
4  22  3001 4000                    NA                    NA
5  22  4001 5000                    NA                    NA
6  22  5001 6000                    NA                    NA
  GL4062_S22_L001_R1_001 GL4102_S11_L001_R1_001
1                     NA                     NA
2                     NA                     NA
3                     NA                     NA
4                     NA                     NA
5                     NA                     NA
6                     NA                     NA
Browse[3]> n
debug: bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] * 
    med.med/med.samp[[samp.i]], 2)
Browse[3]> n
debug: bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] * 
    med.med/med.samp[[samp.i]], 2)
Browse[3]> n
debug: bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] * 
    med.med/med.samp[[samp.i]], 2)
Browse[3]> n
debug: bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] * 
    med.med/med.samp[[samp.i]], 2)
Browse[3]> n
debug: bc.df = bc.df[order(bc.df$chr, bc.df$start, bc.df$end), ]
Browse[3]> n
debug: if (!is.null(file)) {
    write.table(bc.df, file = file, quote = FALSE, row.names = FALSE, 
        sep = "\t", append = append.f, col.names = !append.f)
}
Browse[3]> head(bc.df)
  chr start  end GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001
1  22     1 1000                   NaN                   NaN
2  22  1001 2000                   NaN                   NaN
3  22  2001 3000                   NaN                   NaN
4  22  3001 4000                   NaN                   NaN
5  22  4001 5000                   NaN                   NaN
6  22  5001 6000                   NaN                   NaN
  GL4062_S22_L001_R1_001 GL4102_S11_L001_R1_001
1                    NaN                    NaN
2                    NaN                    NaN
3                    NaN                    NaN
4                    NaN                    NaN
5                    NaN                    NaN
6                    NaN                    NaN
Browse[3]> n
debug: if (!is.null(sub.bc)) {
    bc.df = bc.df[sample(1:nrow(bc.df), min(c(nrow(bc.df), sub.bc))), 
        ]
}
Browse[3]> n
debug: return(bc.df)
Browse[3]> n
exiting from: read.bc.samples(bin.df, files.df, nb.cores)
debug: write.table(bc.df, file = outfile.prefix, quote = FALSE, row.names = FALSE, 
    sep = "\t")
Browse[2]> n
debug: if (appendIndex.outfile) {
    final.file = paste(outfile.prefix, ".bgz", sep = "")
    Rsamtools::bgzip(outfile.prefix, dest = final.file, overwrite = TRUE)
    file.remove(outfile.prefix)
    Rsamtools::indexTabix(final.file, format = "bed")
    outfile.prefix = final.file
}
Browse[2]> n
debug: final.file = paste(outfile.prefix, ".bgz", sep = "")
Browse[2]> n
debug: Rsamtools::bgzip(outfile.prefix, dest = final.file, overwrite = TRUE)
Browse[2]> n
debug: file.remove(outfile.prefix)
Browse[2]> n
debug: Rsamtools::indexTabix(final.file, format = "bed")
Browse[2]> n
debug: outfile.prefix = final.file
Browse[2]> n
debug: bc.mv = medvar.norm.internal(bc.df[, ref.samples])
Browse[2]> debug(medvar.norm.internal)
Browse[2]> n
debugging in: medvar.norm.internal(bc.df[, ref.samples])
debug: {
    med = apply(bc, 2, median, na.rm = TRUE)
    if (any(med == 0)) 
        med[which(med == 0)] = 1
    med.c = mean(med)
    bc = t(t(bc) * med.c/med)
    bc = bc - med.c
    md = apply(bc, 2, function(x) median(abs(x), na.rm = TRUE))
    md.c = median(abs(bc), na.rm = TRUE)
    bc = t(t(bc) * md.c/md)
    return(bc + med.c)
}
Browse[3]> ls()
[1] "bc"
Browse[3]> head(bc)
  GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001 GL4062_S22_L001_R1_001
1                   NaN                   NaN                    NaN
2                   NaN                   NaN                    NaN
3                   NaN                   NaN                    NaN
4                   NaN                   NaN                    NaN
5                   NaN                   NaN                    NaN
6                   NaN                   NaN                    NaN
  GL4102_S11_L001_R1_001
1                    NaN
2                    NaN
3                    NaN
4                    NaN
5                    NaN
6                    NaN
Browse[3]> n
debug: med = apply(bc, 2, median, na.rm = TRUE)
Browse[3]> n
debug: if (any(med == 0)) med[which(med == 0)] = 1
Browse[3]> head(med)
 GL3610_S9_L001_R1_001  GL3893_S1_L001_R1_001 GL4062_S22_L001_R1_001 
                    NA                     NA                     NA 
GL4102_S11_L001_R1_001 
                    NA 
Browse[3]> any(med == 0)
[1] NA
Browse[3]> n
Error in if (any(med == 0)) med[which(med == 0)] = 1 : 
  missing value where TRUE/FALSE needed

jmonlong commented 8 years ago

Thanks for this extensive investigation. It makes sense now

So I imagine you work with targeted sequencing, or extremely low coverage, to have reads in only 0.02% of the chromosome ? In this case it is semi-deliberate that it breaks, because we assume that at this stage (reference sample definition) you work with covered regions. Side note if it's very low coverage whole-genome sequencing you might want to use larger bins to get enough signal.

To analyze large-scale exome sequencing, I usually filter non-covered bins before counting all the samples:

I run the bin counts on all bins for a few samples.
Find which bins are covered.
Count all the samples only on these bins.

In practice it boils down to reducing the bins.df object. Here is an example (in BatchJobs style) :

## Get counts for all bins for some samples
rand.samps = sample(files.df$sample, 10)
getBCquick.reg <- makeRegistry(id="getBCquick")
getBCquick.f <- function(bins.f, files.df, rand.samps){
  load(bins.f)
  files.df = subset(files.df, sample %in% rand.samps)
  library(PopSV)
  quick.count(files.df, bins.df, nb.cores=6)
}
batchMap(getBCquick.reg, getBCquick.f,"bins.RData", more.args=list(files.df=files.df, rand.samps=rand.samps))
submitJobs(getBCquick.reg, 1, resources=list(walltime="6:0:0", nodes="1", cores="6",queue="sw"), wait=function(retries) 100, max.retries=10)
showStatus(getBCquick.reg)

## Remove non-covered bins
bc.sub = loadResult(getBCquick.reg,1)
med.bc = apply(bc.sub[,-(1:3)],1, median, na.rm=TRUE)
summary(med.bc)
bc.sub = bc.sub[which(med.bc>10), 1:3]
bins.df = merge(bins.df, bc.sub)
save(bins.df, file="bins.RData")

Then the normal analysis can be performed using this new bins.df.

Otherwise, if you already counted the reads in all the bins of all your samples, you can just use a reduced bins.df from qc.samples step and after. To find the covered bins you could try something like this :

## Get bin count from 10 samples
bc.sub = lapply(sample(files.df$bc.gz, 10), function(filename){
  read.table(filename, as.is=TRUE, header=TRUE, sep="\t")
})
bc.sub = cbind(bc.sub[[1]][,1:3], do.call(cbind, lapply(bc.sub, function(bc.s)bc.s$bc)))
## Find non-covered bins
med.bc = apply(bc.sub[,-(1:3)],1, median, na.rm=TRUE)
bc.sub = bc.sub[which(med.bc>50), 1:3] ## Or whatever threshold you think is best
bins.df = merge(bins.df, bc.sub)
save(bins.df, file="bins.RData")

And then you would continue normally with

samp.qc.o = qc.samples(files.df, bins.df, ref.samples, outfile.prefix=bc.all.f)
...

Thanks,

mkolovic commented 8 years ago

Hello, thank you for the suggestions. We are doing targeted sequencing so I will implement a method to easily restrict bins.df to our regions of interest. For now I have been able to get past the error above via your methods.

jmonlong / PopSV

Error when calling function qc.samples #1

Summary

Full Debug