I've beein using qc_read_collection(), on many "*_fastqc.zip" files, and noticed that this function suffers from dplyr issue #5358 when binding data.frames.
Here is a reprex leading to the error in lapply(res, dplyr::bind_rows, .id = "sample") inside qc_read_collection():
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# create example data.frames to be bound using dplyr::bind_rows()
dn <- data.frame(Length = 150, Count = 2)
ds <- data.frame(Length = c("150-155"), Count = 4)
de <- data.frame(array(NA, dim = c(0,0)))
res <- list(module = list(Sample1=dn, Sample2=ds, Sample3=de))
str(res)
#> List of 1
#> $ module:List of 3
#> ..$ Sample1:'data.frame': 1 obs. of 2 variables:
#> .. ..$ Length: num 150
#> .. ..$ Count : num 2
#> ..$ Sample2:'data.frame': 1 obs. of 2 variables:
#> .. ..$ Length: chr "150-155"
#> .. ..$ Count : num 4
#> ..$ Sample3:'data.frame': 0 obs. of 0 variables
# reproduce the error
res <- lapply(res, dplyr::bind_rows, .id = "sample")
#> Error: Can't combine `Sample1$Length` <double> and `Sample2$Length` <character>.
The error above will occur when calling qc_read_collection(files, sample_names, modules = "all") on a collection of "*_fastqc.zip" files, if there is a sample in files that has a different class for any variable in the data.frame to be bound.
In my case, this happened mostly with the modules $sequence_length_distribution (variable "Length") or $kmer_content (variable "Max Obs/Exp Position").
Here is a possible fix I came up with:
# convert <double> to <character> if a column should be <character>
res <- lapply(res, function(x) {
# tibble with classes for each non-emtpy data.frame column
dcl <- dplyr::bind_rows(lapply(x, function(y) {
if (nrow(y) > 0) sapply(y, class)
}))
# define classes to assign
cl <- apply(dcl, 2, function(z) ifelse(any(z=="character"),"character",z[1]))
# assign classes
lapply(x, function(w) {
if (nrow(w) > 0) {for (i in names(w)) {class(w[,i]) <- cl[i]} ; w}
})
})
# reproduce the fix
res <- lapply(res, dplyr::bind_rows, .id = "sample")
str(res)
#> List of 1
#> $ module:'data.frame': 2 obs. of 3 variables:
#> ..$ sample: chr [1:2] "Sample1" "Sample2"
#> ..$ Length: chr [1:2] "150" "150-155"
#> ..$ Count : num [1:2] 2 4
Perhaps a patch for qc_read_collection() similar to the one below (enclosed by ##<##<##) could be useful generally, given that dplyr is not going to fix this because it is a "deliberate design decision" (see #5358)?
qc_read_collection <- function(files, sample_names, modules = "all", verbose=T)
{
module_data <- lapply(files, qc_read, modules = modules,
verbose = verbose)
if (missing(sample_names) || length(sample_names) != length(files)) {
sample_names <- lapply(module_data, function(x) unique(x$summary))
sample_names <- unlist(sample_names)
}
names(module_data) <- sample_names
module_names <- unique(unlist(lapply(module_data, names)))
res <- list()
for (i in seq_along(module_names)) {
res[[i]] <- lapply(module_data, function(x) as.data.frame(x[[module_names[i]]]))
}
names(res) <- module_names
##<##<## begin patch
res <- lapply(res, function(x) {
dcl <- dplyr::bind_rows(lapply(x, function(y) {
if (nrow(y) > 0) sapply(y, class)
}))
cl <- apply(dcl, 2, function(z) ifelse(any(z=="character"),"character",z[1]))
lapply(x, function(w) {
if (nrow(w) > 0) {for (i in names(w)) {class(w[,i]) <- cl[i]} ; w}
})
})
##<##<## end patch
res <- lapply(res, dplyr::bind_rows, .id = "sample")
res <- structure(res, class = c("list", "qc_read_collection"))
res
}
Perhaps you'd like to look into this yourself, and maybe come up with an easier and prettier solution? :)
Dear Alboukadel,
Many thanks for this and other handy R packages!
I've beein using
qc_read_collection()
, on many"*_fastqc.zip"
files, and noticed that this function suffers from dplyr issue #5358 when binding data.frames.Here is a reprex leading to the error in
lapply(res, dplyr::bind_rows, .id = "sample")
insideqc_read_collection()
:The error above will occur when calling
qc_read_collection(files, sample_names, modules = "all")
on a collection of"*_fastqc.zip"
files, if there is a sample infiles
that has a different class for any variable in thedata.frame
to be bound.In my case, this happened mostly with the modules
$sequence_length_distribution
(variable "Length") or$kmer_content
(variable "Max Obs/Exp Position").Here is a possible fix I came up with:
Created on 2021-12-21 by the reprex package (v2.0.1)
Perhaps a patch for
qc_read_collection()
similar to the one below (enclosed by ##<##<##) could be useful generally, given thatdplyr
is not going to fix this because it is a "deliberate design decision" (see #5358)?Perhaps you'd like to look into this yourself, and maybe come up with an easier and prettier solution? :)
Cheers, Simon