Closed mkolovic closed 8 years ago
I ran testthat.R
and the output is the following:
$ sudo Rscript testthat.R
Loading required package: methods
Warning message:
replacing previous import by ‘data.table::shift’ when loading ‘PopSV’
testthat results ================================================================
OK: 167 SKIPPED: 0 FAILED: 0
From your error message and your description, the most likely problem might be an error in the previous step (bin.bam
/correct.GC
). You can check that the results are correct by running the optional block in run-PopSV-local.R
that uses quick.count
and qc.samples.cluster
(right before the qc.samples
step). You could also check manually one of the *-bc-gcCor.tsv.bgz
in the folder of one sample.
Incorrect sample names could also produce this error, can you check that files.df[ c(1,5,7,8), 1]
does give you sample names ?
I tried running a toy example with only 4 reference samples as you did, and didn't get any error.
Some other comments :
forPaper
branch is the one to use. It's the most up-to-date and will become the release version in a week or so.group
information in files.df
etc. By default all samples will be used as reference. Or you can manually define the ref.samples
as you tried.Let me know how it goes, I will add more informative error messages when the issue is found. Thanks in advance for the feedback.
qc.samples
when read.bc.samples
is called.bc.l
is created a small fraction of the entries are non-zero, so when median
is applied to these entries to calculate med.samp
and med.med
they end up being zero.med.med
and med.samp
are used to compute bc.df
sample entries but med.med/med.samp
results in NaN
due to division by zero.bc.df
is then passed to medvar.norm.internal
which performs a median
on the NaN
values resulting in med
being full of NA
if ((any(med == 0))
returns NA
where a boolean is expected resulting in an error.Thanks.
I have run everything before the optional commands, the environment looks like:
$ ls
bams.tsv
bins.Rdata
GL3610_S9_L001_R1_001
GL3610_S9_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL3610_S9_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL3724_S20_L001_R1_001
GL3724_S20_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL3724_S20_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL3767_S21_L001_R1_001
GL3767_S21_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL3767_S21_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL3893_S1_L001_R1_001
GL3893_S1_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL3893_S1_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4033_S10_L001_R1_001
GL4033_S10_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4033_S10_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4062_S22_L001_R1_001
GL4062_S22_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4062_S22_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4102_S11_L001_R1_001
GL4102_S11_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4102_S11_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4202_S12_L001_R1_001
GL4202_S12_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4202_S12_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4220_S19_L001_R1_001
GL4220_S19_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4220_S19_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4279_S12_L001_R1_001
GL4279_S12_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4279_S12_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4645_S13_L001_R1_001
GL4645_S13_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4645_S13_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4654_S23_L001_R1_001
GL4654_S23_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4654_S23_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL4817_S14_L001_R1_001
GL4817_S14_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL4817_S14_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5054_S24_L001_R1_001
GL5054_S24_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5054_S24_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5536_S15_L001_R1_001
GL5536_S15_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5536_S15_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5606_S16_L001_R1_001
GL5606_S16_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5606_S16_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5718_S17_L001_R1_001
GL5718_S17_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5718_S17_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5735_S18_L001_R1_001
GL5735_S18_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5735_S18_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5743_S15_L001_R1_001
GL5743_S15_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5743_S15_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
GL5944_S16_L001_R1_001
GL5944_S16_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
GL5944_S16_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam.bai
> ls()
[1] "bam.files" "bc.o" "bins.df" "bin.size" "files.df"
[6] "ref.samples"
> head(bins.df, n=3)
chr start end GCcontent
1 22 1 1000 0
2 22 1001 2000 0
3 22 2001 3000 0
> head(subset(bins.df, GCcontent!=0), n=3)
chr start end GCcontent
16051 22 16050001 16051000 0.431
16052 22 16051001 16052000 0.416
16053 22 16052001 16053000 0.422
> nrow(bins.df)
[1] 51305
> nrow(subset(bins.df, GCcontent!=0))
[1] 34898
> head(files.df, n=3)
sample
1 GL3610_S9_L001_R1_001
2 GL3724_S20_L001_R1_001
3 GL3767_S21_L001_R1_001
bam
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001_(paired)_(Reads)_-_locally_realigned_without_duplicates.bam
group
1 normal
2 ?
3 ?
bc
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc.tsv
bc.gz
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc.tsv.bgz
bc.gc
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor.tsv
bc.gc.gz
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor.tsv.bgz
bc.gc.norm
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor-norm.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor-norm.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor-norm.tsv
bc.gc.norm.gz
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor-norm.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor-norm.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor-norm.tsv.bgz
z
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-z.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-z.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-z.tsv
z.gz
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-z.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-z.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-z.tsv.bgz
fc
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-fc.tsv
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-fc.tsv
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-fc.tsv
fc.gz
1 /home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-fc.tsv.bgz
2 /home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-fc.tsv.bgz
3 /home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-fc.tsv.bgz
> ref.samples
[1] "GL3610_S9_L001_R1_001" "GL3893_S1_L001_R1_001" "GL4062_S22_L001_R1_001"
[4] "GL4102_S11_L001_R1_001"
> head(bc.o, n=3)
[1] "/home/ubuntu/r_work/popsv_testing/bams/GL3610_S9_L001_R1_001/GL3610_S9_L001_R1_001-bc-gcCor.tsv.bgz"
[2] "/home/ubuntu/r_work/popsv_testing/bams/GL3724_S20_L001_R1_001/GL3724_S20_L001_R1_001-bc-gcCor.tsv.bgz"
[3] "/home/ubuntu/r_work/popsv_testing/bams/GL3767_S21_L001_R1_001/GL3767_S21_L001_R1_001-bc-gcCor.tsv.bgz"
bin.bam
and correct.GC
, note that bgzip
refuses to unzip unless the extension is .gz
, a fraction of chr22 has non-zero bin count:$ bgzip -cd GL3610_S9_L001_R1_001-bc.tsv.gz | wc -l
51306
$ bgzip -cd GL3610_S9_L001_R1_001-bc.tsv.gz | awk '$4!=0 { print }' | wc -l
1176
$ bgzip -cd GL3610_S9_L001_R1_001-bc.tsv.gz | awk '$4!=0 { print }' | less
22 42347001 42348000 2
22 42372001 42373000 2
22 42374001 42375000 2
22 42433001 42434000 6
22 42478001 42479000 2
22 42522001 42523000 900
22 42523001 42524000 1761
22 42524001 42525000 2019
22 42525001 42526000 1296
22 42526001 42527000 1313
22 42527001 42528000 196
22 42535001 42536000 174
22 42536001 42537000 1216
22 42537001 42538000 1385
22 42538001 42539000 1297
22 42539001 42540000 187
22 42540001 42541000 1068
22 42545001 42546000 180
22 42546001 42547000 1150
22 42547001 42548000 827
22 42548001 42549000 442
22 42549001 42550000 332
22 42550001 42551000 641
22 42551001 42552000 357
22 42561001 42562000 2
22 42597001 42598000 2
$ bgzip -cd GL3610_S9_L001_R1_001-bc-gcCor.tsv.gz | wc -l
51306
$ bgzip -cd GL3610_S9_L001_R1_001-bc-gcCor.tsv.gz | awk '$4!=0 { print }' | wc -l
1082
$ bgzip -cd GL3610_S9_L001_R1_001-bc-gcCor.tsv.gz | awk '$4!=0 { print }' | less
22 42347001 42348000 0.58
22 42372001 42373000 0.34
22 42374001 42375000 1.21
22 42478001 42479000 3.53
22 42522001 42523000 156.78
22 42523001 42524000 289.63
22 42524001 42525000 330.36
22 42525001 42526000 209.38
22 42526001 42527000 257.87
22 42527001 42528000 54.38
22 42535001 42536000 187.3
22 42536001 42537000 200
22 42537001 42538000 231.55
22 42538001 42539000 205.73
22 42539001 42540000 29.68
22 42540001 42541000 195.13
22 42545001 42546000 86.17
22 42546001 42547000 184.27
22 42547001 42548000 144.06
22 42548001 42549000 70.31
22 42549001 42550000 70.24
22 42550001 42551000 174.5
22 42551001 42552000 124.16
22 42561001 42562000 0.7
22 42597001 42598000 9.32
> bc.all.f = "bc-gcCor-all.tsv"
> pdf("sampQC.pdf")
> samp.qc.o = qc.samples(files.df, bins.df, outfile.prefix=bc.all.f, ref.samples=ref.samples)
Error in if (any(med == 0)) med[which(med == 0)] = 1 :
missing value where TRUE/FALSE needed
> traceback()
2: medvar.norm.internal(bc.df[, ref.samples])
1: qc.samples(files.df, bins.df, outfile.prefix = bc.all.f, ref.samples = ref.samples)
> dev.off()
null device
1
qc.samples
calls medvar.norm.internal
, we will take a closer look with debug()
:> pdf("sampQC.pdf")
> debug(qc.samples)
> samp.qc.o = qc.samples(files.df, bins.df, outfile.prefix=bc.all.f, ref.samples=ref.samples)
** outputs function body and enters debug mode**
Browse[2]> n
debug: if (!all(c("chr", "start", "end") %in% colnames(bin.df))) {
stop("Missing column in 'bin.df'. 'chr', 'start' and 'end' are required.")
}
Browse[2]> n
debug: if (!all(c("sample", col.bc) %in% colnames(files.df))) {
stop("Missing column in 'files.df'. 'sample', '", col.bc,
"' are required.")
}
Browse[2]> n
debug: if (is.null(ref.samples)) {
ref.samples = as.character(files.df$sample)
}
Browse[2]> n
debug: if (is.null(nb.ref.samples)) {
nb.ref.samples = length(ref.samples)
}
Browse[2]> n
debug: nb.ref.samples = length(ref.samples)
Browse[2]> n
debug: read.bc.samples <- function(df, files.df, nb.cores = 1, med.med = NULL,
med.samp = NULL, file = NULL, append.f = FALSE, sub.bc = NULL) {
bc.df = createEmptyDF(c("character", rep("integer", 2), rep("numeric",
nrow(files.df))), nrow(df))
colnames(bc.df) = c("chr", "start", "end", as.character(files.df$sample))
bc.df$chr = df$chr
bc.df$start = df$start
bc.df$end = df$end
if (nb.cores > 1) {
bc.l = parallel::mclapply(as.character(files.df[, col.bc]),
function(fi) {
read.bedix(fi, df)[, 4]
}, mc.cores = nb.cores)
}
else {
bc.l = lapply(as.character(files.df[, col.bc]), function(fi) {
read.bedix(fi, df)[, 4]
})
}
if (is.null(med.med) & is.null(med.samp)) {
med.samp = lapply(bc.l, median, na.rm = TRUE)
med.med = median(unlist(med.samp), na.rm = TRUE)
}
for (samp.i in 1:nrow(files.df)) {
bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] *
med.med/med.samp[[samp.i]], 2)
}
bc.df = bc.df[order(bc.df$chr, bc.df$start, bc.df$end), ]
if (!is.null(file)) {
write.table(bc.df, file = file, quote = FALSE, row.names = FALSE,
sep = "\t", append = append.f, col.names = !append.f)
}
if (!is.null(sub.bc)) {
bc.df = bc.df[sample(1:nrow(bc.df), min(c(nrow(bc.df),
sub.bc))), ]
}
return(bc.df)
}
Browse[2]> n
debug: center.pt <- function(m) {
dm = as.matrix(dist(m))
diag(dm) = NA
rownames(m)[which.min(apply(dm, 2, mean, na.rm = TRUE))]
}
Browse[2]> debug(read.bc.samples)
Browse[2]> n
debug: files.df = files.df[which(files.df$sample %in% ref.samples),
]
Browse[2]> ls()
[1] "appendIndex.outfile" "bin.df" "center.pt"
[4] "chunk.size" "col.bc" "files.df"
[7] "nb.cores" "nb.ref.samples" "outfile.prefix"
[10] "plot" "read.bc.samples" "ref.samples"
Browse[2]> n
debug: pc.all.df = bc.rand = NULL
Browse[2]> n
debug: if (length(ref.samples) > nb.ref.samples) {
sparse.pts <- function(m, nb.pts) {
dm = as.matrix(dist(m))
diag(dm) = NA
pts.jj = 1:nrow(dm)
pts.ii = which.max(apply(dm, 1, max, na.rm = TRUE))
pts.jj = pts.jj[-pts.ii]
while (length(pts.ii) < nb.pts) {
m.ii = which.max(apply(dm[pts.ii, pts.jj, drop = FALSE],
2, min, na.rm = TRUE))
pts.ii = c(pts.ii, pts.jj[m.ii])
pts.jj = pts.jj[-m.ii]
}
m[pts.ii, ]
}
bc.rand = read.bc.samples(bin.df[sample.int(nrow(bin.df),
min(c(nrow(bin.df)/2, 1000))), ], files.df, med.med = 1,
med.samp = rep(list(1), nrow(files.df)))
bc.mv = medvar.norm.internal(bc.rand[, ref.samples])
pc.all = prcomp(t(na.exclude(bc.mv)))
pc.all.df = data.frame(pc.all$x[, 1:3])
pc.all.df$sample = ref.samples
sp.o = sparse.pts(pc.all$x[, 1:2], nb.ref.samples)
pc.all.df$reference = pc.all.df$sample %in% rownames(sp.o)
if (plot) {
PC1 = PC2 = reference = NULL
print(ggplot2::ggplot(pc.all.df, ggplot2::aes(x = PC1,
y = PC2, size = reference)) + ggplot2::geom_point(alpha = 0.8) +
ggplot2::theme_bw() + ggplot2::scale_size_manual(values = 2:3))
}
ref.samples = rownames(sp.o)
bc.rand = bc.rand[, c("chr", "start", "end", ref.samples)]
}
Browse[2]> n
debug: files.df = files.df[which(files.df$sample %in% ref.samples),
]
Browse[2]> n
debug: bin.df = bin.df[order(bin.df$chr, bin.df$start, bin.df$end),
]
Browse[2]> n
debug: if (nrow(bin.df) < 1.3 * chunk.size) {
bc.df = read.bc.samples(bin.df, files.df, nb.cores)
write.table(bc.df, file = outfile.prefix, quote = FALSE,
row.names = FALSE, sep = "\t")
} else {
nb.chunks = ceiling(nrow(bin.df)/chunk.size)
bin.df$chunk = rep(1:nb.chunks, each = chunk.size)[1:nrow(bin.df)]
analyze.chunk <- function(df) {
ch.nb = as.numeric(df$chunk[1])
df.o = read.bedix(as.character(files.df[1, col.bc]),
df)
read.bc.samples(df.o, files.df, nb.cores, med.med, med.samp,
file = outfile.prefix, append.f = ch.nb > 1, sub.bc = chunk.size/nb.chunks)
}
if (is.null(bc.rand)) {
bc.rand = read.bc.samples(bin.df[sample.int(nrow(bin.df),
min(c(nrow(bin.df)/2, 1000))), ], files.df, med.med = 1,
med.samp = rep(list(1), nrow(files.df)))
}
med.samp = lapply(as.character(files.df$sample), function(samp.i) median(bc.rand[,
samp.i], na.rm = TRUE))
med.med = median(unlist(med.samp), na.rm = TRUE)
bc.res = lapply(unique(bin.df$chunk), function(chunk.i) {
analyze.chunk(bin.df[which(bin.df$chunk == chunk.i),
])
})
bc.df = plyr::ldply(bc.res, identity)
}
Browse[2]> n
debug: bc.df = read.bc.samples(bin.df, files.df, nb.cores)
Browse[2]> n
debugging in: read.bc.samples(bin.df, files.df, nb.cores)
debug: {
bc.df = createEmptyDF(c("character", rep("integer", 2), rep("numeric",
nrow(files.df))), nrow(df))
colnames(bc.df) = c("chr", "start", "end", as.character(files.df$sample))
bc.df$chr = df$chr
bc.df$start = df$start
bc.df$end = df$end
if (nb.cores > 1) {
bc.l = parallel::mclapply(as.character(files.df[, col.bc]),
function(fi) {
read.bedix(fi, df)[, 4]
}, mc.cores = nb.cores)
}
else {
bc.l = lapply(as.character(files.df[, col.bc]), function(fi) {
read.bedix(fi, df)[, 4]
})
}
if (is.null(med.med) & is.null(med.samp)) {
med.samp = lapply(bc.l, median, na.rm = TRUE)
med.med = median(unlist(med.samp), na.rm = TRUE)
}
for (samp.i in 1:nrow(files.df)) {
bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] *
med.med/med.samp[[samp.i]], 2)
}
bc.df = bc.df[order(bc.df$chr, bc.df$start, bc.df$end), ]
if (!is.null(file)) {
write.table(bc.df, file = file, quote = FALSE, row.names = FALSE,
sep = "\t", append = append.f, col.names = !append.f)
}
if (!is.null(sub.bc)) {
bc.df = bc.df[sample(1:nrow(bc.df), min(c(nrow(bc.df),
sub.bc))), ]
}
return(bc.df)
}
Browse[3]> n
debug: bc.df = createEmptyDF(c("character", rep("integer", 2), rep("numeric",
nrow(files.df))), nrow(df))
Browse[3]> n
debug: colnames(bc.df) = c("chr", "start", "end", as.character(files.df$sample))
Browse[3]> head(bc.df, n=2)
V1 V2 V3 V4 V5 V6 V7
1 <NA> NA NA NA NA NA NA
2 <NA> NA NA NA NA NA NA
Browse[3]> n
debug: bc.df$chr = df$chr
Browse[3]> head(bc.df, n=2)
chr start end GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001
1 <NA> NA NA NA NA
2 <NA> NA NA NA NA
GL4062_S22_L001_R1_001 GL4102_S11_L001_R1_001
1 NA NA
2 NA NA
Browse[3]> n
debug: bc.df$start = df$start
Browse[3]> n
debug: bc.df$end = df$end
Browse[3]> n
debug: if (nb.cores > 1) {
bc.l = parallel::mclapply(as.character(files.df[, col.bc]),
function(fi) {
read.bedix(fi, df)[, 4]
}, mc.cores = nb.cores)
} else {
bc.l = lapply(as.character(files.df[, col.bc]), function(fi) {
read.bedix(fi, df)[, 4]
})
}
Browse[3]> head(bc.df, n=2)
chr start end GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001
1 22 1 1000 NA NA
2 22 1001 2000 NA NA
GL4062_S22_L001_R1_001 GL4102_S11_L001_R1_001
1 NA NA
2 NA NA
Browse[3]> n
debug: bc.l = lapply(as.character(files.df[, col.bc]), function(fi) {
read.bedix(fi, df)[, 4]
})
Browse[3]> n
debug: if (is.null(med.med) & is.null(med.samp)) {
med.samp = lapply(bc.l, median, na.rm = TRUE)
med.med = median(unlist(med.samp), na.rm = TRUE)
}
Browse[3]> length(bc.l[[1]])
[1] 51305
Browse[3]> length(subset(bc.l[[1]], bc.l[[1]]!=0))
[1] 1081
Browse[3]> length(bc.l[[2]])
[1] 51305
Browse[3]> length(subset(bc.l[[2]], bc.l[[2]]!=0))
[1] 1525
Browse[3]> n
debug: med.samp = lapply(bc.l, median, na.rm = TRUE)
Browse[3]> n
debug: med.med = median(unlist(med.samp), na.rm = TRUE)
Browse[3]> med.samp
[[1]]
[1] 0
[[2]]
[1] 0
[[3]]
[1] 0
[[4]]
[1] 0
Browse[3]> n
debug: for (samp.i in 1:nrow(files.df)) {
bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] *
med.med/med.samp[[samp.i]], 2)
}
Browse[3]> med.med
[1] 0
Browse[3]> med.med/med.samp[[1]]
[1] NaN
Browse[3]> head(bc.df)
chr start end GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001
1 22 1 1000 NA NA
2 22 1001 2000 NA NA
3 22 2001 3000 NA NA
4 22 3001 4000 NA NA
5 22 4001 5000 NA NA
6 22 5001 6000 NA NA
GL4062_S22_L001_R1_001 GL4102_S11_L001_R1_001
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
Browse[3]> n
debug: bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] *
med.med/med.samp[[samp.i]], 2)
Browse[3]> n
debug: bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] *
med.med/med.samp[[samp.i]], 2)
Browse[3]> n
debug: bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] *
med.med/med.samp[[samp.i]], 2)
Browse[3]> n
debug: bc.df[, as.character(files.df$sample[samp.i])] = round(bc.l[[samp.i]] *
med.med/med.samp[[samp.i]], 2)
Browse[3]> n
debug: bc.df = bc.df[order(bc.df$chr, bc.df$start, bc.df$end), ]
Browse[3]> n
debug: if (!is.null(file)) {
write.table(bc.df, file = file, quote = FALSE, row.names = FALSE,
sep = "\t", append = append.f, col.names = !append.f)
}
Browse[3]> head(bc.df)
chr start end GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001
1 22 1 1000 NaN NaN
2 22 1001 2000 NaN NaN
3 22 2001 3000 NaN NaN
4 22 3001 4000 NaN NaN
5 22 4001 5000 NaN NaN
6 22 5001 6000 NaN NaN
GL4062_S22_L001_R1_001 GL4102_S11_L001_R1_001
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
Browse[3]> n
debug: if (!is.null(sub.bc)) {
bc.df = bc.df[sample(1:nrow(bc.df), min(c(nrow(bc.df), sub.bc))),
]
}
Browse[3]> n
debug: return(bc.df)
Browse[3]> n
exiting from: read.bc.samples(bin.df, files.df, nb.cores)
debug: write.table(bc.df, file = outfile.prefix, quote = FALSE, row.names = FALSE,
sep = "\t")
Browse[2]> n
debug: if (appendIndex.outfile) {
final.file = paste(outfile.prefix, ".bgz", sep = "")
Rsamtools::bgzip(outfile.prefix, dest = final.file, overwrite = TRUE)
file.remove(outfile.prefix)
Rsamtools::indexTabix(final.file, format = "bed")
outfile.prefix = final.file
}
Browse[2]> n
debug: final.file = paste(outfile.prefix, ".bgz", sep = "")
Browse[2]> n
debug: Rsamtools::bgzip(outfile.prefix, dest = final.file, overwrite = TRUE)
Browse[2]> n
debug: file.remove(outfile.prefix)
Browse[2]> n
debug: Rsamtools::indexTabix(final.file, format = "bed")
Browse[2]> n
debug: outfile.prefix = final.file
Browse[2]> n
debug: bc.mv = medvar.norm.internal(bc.df[, ref.samples])
Browse[2]> debug(medvar.norm.internal)
Browse[2]> n
debugging in: medvar.norm.internal(bc.df[, ref.samples])
debug: {
med = apply(bc, 2, median, na.rm = TRUE)
if (any(med == 0))
med[which(med == 0)] = 1
med.c = mean(med)
bc = t(t(bc) * med.c/med)
bc = bc - med.c
md = apply(bc, 2, function(x) median(abs(x), na.rm = TRUE))
md.c = median(abs(bc), na.rm = TRUE)
bc = t(t(bc) * md.c/md)
return(bc + med.c)
}
Browse[3]> ls()
[1] "bc"
Browse[3]> head(bc)
GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001 GL4062_S22_L001_R1_001
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
GL4102_S11_L001_R1_001
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
Browse[3]> n
debug: med = apply(bc, 2, median, na.rm = TRUE)
Browse[3]> n
debug: if (any(med == 0)) med[which(med == 0)] = 1
Browse[3]> head(med)
GL3610_S9_L001_R1_001 GL3893_S1_L001_R1_001 GL4062_S22_L001_R1_001
NA NA NA
GL4102_S11_L001_R1_001
NA
Browse[3]> any(med == 0)
[1] NA
Browse[3]> n
Error in if (any(med == 0)) med[which(med == 0)] = 1 :
missing value where TRUE/FALSE needed
Thanks for this extensive investigation. It makes sense now
So I imagine you work with targeted sequencing, or extremely low coverage, to have reads in only 0.02% of the chromosome ? In this case it is semi-deliberate that it breaks, because we assume that at this stage (reference sample definition) you work with covered regions. Side note if it's very low coverage whole-genome sequencing you might want to use larger bins to get enough signal.
To analyze large-scale exome sequencing, I usually filter non-covered bins before counting all the samples:
In practice it boils down to reducing the bins.df
object. Here is an example (in BatchJobs style) :
## Get counts for all bins for some samples
rand.samps = sample(files.df$sample, 10)
getBCquick.reg <- makeRegistry(id="getBCquick")
getBCquick.f <- function(bins.f, files.df, rand.samps){
load(bins.f)
files.df = subset(files.df, sample %in% rand.samps)
library(PopSV)
quick.count(files.df, bins.df, nb.cores=6)
}
batchMap(getBCquick.reg, getBCquick.f,"bins.RData", more.args=list(files.df=files.df, rand.samps=rand.samps))
submitJobs(getBCquick.reg, 1, resources=list(walltime="6:0:0", nodes="1", cores="6",queue="sw"), wait=function(retries) 100, max.retries=10)
showStatus(getBCquick.reg)
## Remove non-covered bins
bc.sub = loadResult(getBCquick.reg,1)
med.bc = apply(bc.sub[,-(1:3)],1, median, na.rm=TRUE)
summary(med.bc)
bc.sub = bc.sub[which(med.bc>10), 1:3]
bins.df = merge(bins.df, bc.sub)
save(bins.df, file="bins.RData")
Then the normal analysis can be performed using this new bins.df
.
Otherwise, if you already counted the reads in all the bins of all your samples, you can just use a reduced bins.df
from qc.samples
step and after. To find the covered bins you could try something like this :
## Get bin count from 10 samples
bc.sub = lapply(sample(files.df$bc.gz, 10), function(filename){
read.table(filename, as.is=TRUE, header=TRUE, sep="\t")
})
bc.sub = cbind(bc.sub[[1]][,1:3], do.call(cbind, lapply(bc.sub, function(bc.s)bc.s$bc)))
## Find non-covered bins
med.bc = apply(bc.sub[,-(1:3)],1, median, na.rm=TRUE)
bc.sub = bc.sub[which(med.bc>50), 1:3] ## Or whatever threshold you think is best
bins.df = merge(bins.df, bc.sub)
save(bins.df, file="bins.RData")
And then you would continue normally with
samp.qc.o = qc.samples(files.df, bins.df, ref.samples, outfile.prefix=bc.all.f)
...
Thanks,
Hello, thank you for the suggestions. We are doing targeted sequencing so I will implement a method to easily restrict bins.df
to our regions of interest. For now I have been able to get past the error above via your methods.
I am attempting to run a local workflow mirroring
run-PopSV-local.R
from thescripts
folder. I am using a small sample of bam files to learn the package and make sure I can run everything. When running:I receive the error:
Some additional information that I think may be helpful:
forPaper
branchbam.tsv
did not include a group column to indicate which samples are reference samples, so neither doesfiles.df
.ref.samples = files.df[ c(1,5,7,8), 1]
was used to grab the reference sample names, i.e. they are hard coded in for now, this will change for future workflows.