kassambara / fastqcr

fastqcr: Quality Control of Sequencing Data
http://www.sthda.com/english/rpkgs/fastqcr
68 stars 23 forks source link

qc_read_collection() error: Can't combine <double> and <character> #22

Open scrameri opened 2 years ago

scrameri commented 2 years ago

Dear Alboukadel,

Many thanks for this and other handy R packages!

I've beein using qc_read_collection(), on many "*_fastqc.zip" files, and noticed that this function suffers from dplyr issue #5358 when binding data.frames.

Here is a reprex leading to the error in lapply(res, dplyr::bind_rows, .id = "sample") inside qc_read_collection():

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# create example data.frames to be bound using dplyr::bind_rows()
dn <- data.frame(Length = 150, Count = 2)
ds <- data.frame(Length = c("150-155"), Count = 4)
de <- data.frame(array(NA, dim = c(0,0)))
res <- list(module = list(Sample1=dn, Sample2=ds, Sample3=de))

str(res)
#> List of 1
#>  $ module:List of 3
#>   ..$ Sample1:'data.frame':  1 obs. of  2 variables:
#>   .. ..$ Length: num 150
#>   .. ..$ Count : num 2
#>   ..$ Sample2:'data.frame':  1 obs. of  2 variables:
#>   .. ..$ Length: chr "150-155"
#>   .. ..$ Count : num 4
#>   ..$ Sample3:'data.frame':  0 obs. of  0 variables

# reproduce the error
res <- lapply(res, dplyr::bind_rows, .id = "sample")
#> Error: Can't combine `Sample1$Length` <double> and `Sample2$Length` <character>.

The error above will occur when calling qc_read_collection(files, sample_names, modules = "all") on a collection of "*_fastqc.zip" files, if there is a sample in files that has a different class for any variable in the data.frame to be bound.

In my case, this happened mostly with the modules $sequence_length_distribution (variable "Length") or $kmer_content (variable "Max Obs/Exp Position").

Here is a possible fix I came up with:

# convert <double> to <character> if a column should be <character>
res <- lapply(res, function(x) {
  # tibble with classes for each non-emtpy data.frame column
  dcl <- dplyr::bind_rows(lapply(x, function(y) {
    if (nrow(y) > 0) sapply(y, class)
  }))
  # define classes to assign
  cl <- apply(dcl, 2, function(z) ifelse(any(z=="character"),"character",z[1]))
  # assign classes
  lapply(x, function(w) {
    if (nrow(w) > 0) {for (i in names(w)) {class(w[,i]) <- cl[i]} ; w}
  })
})

# reproduce the fix
res <- lapply(res, dplyr::bind_rows, .id = "sample")
str(res)
#> List of 1
#>  $ module:'data.frame':  2 obs. of  3 variables:
#>   ..$ sample: chr [1:2] "Sample1" "Sample2"
#>   ..$ Length: chr [1:2] "150" "150-155"
#>   ..$ Count : num [1:2] 2 4

Created on 2021-12-21 by the reprex package (v2.0.1)

Perhaps a patch for qc_read_collection() similar to the one below (enclosed by ##<##<##) could be useful generally, given that dplyr is not going to fix this because it is a "deliberate design decision" (see #5358)?

qc_read_collection <- function(files, sample_names, modules = "all", verbose=T) 
{
  module_data <- lapply(files, qc_read, modules = modules, 
                        verbose = verbose)
  if (missing(sample_names) || length(sample_names) != length(files)) {
    sample_names <- lapply(module_data, function(x) unique(x$summary))
    sample_names <- unlist(sample_names)
  }
  names(module_data) <- sample_names
  module_names <- unique(unlist(lapply(module_data, names)))
  res <- list()
  for (i in seq_along(module_names)) {
    res[[i]] <- lapply(module_data, function(x) as.data.frame(x[[module_names[i]]]))
  }
  names(res) <- module_names

  ##<##<## begin patch
  res <- lapply(res, function(x) {
    dcl <- dplyr::bind_rows(lapply(x, function(y) {
      if (nrow(y) > 0) sapply(y, class)
      }))
    cl <- apply(dcl, 2, function(z) ifelse(any(z=="character"),"character",z[1]))
    lapply(x, function(w) {
      if (nrow(w) > 0) {for (i in names(w)) {class(w[,i]) <- cl[i]} ; w}
    })
  })
  ##<##<## end patch

  res <- lapply(res, dplyr::bind_rows, .id = "sample")
  res <- structure(res, class = c("list", "qc_read_collection"))
  res
}

Perhaps you'd like to look into this yourself, and maybe come up with an easier and prettier solution? :)

Cheers, Simon