hoelzer-lab / rnaflow

A simple RNA-Seq differential gene expression pipeline using Nextflow
GNU General Public License v3.0
91 stars 19 forks source link

Error executing process > 'expression_reference_based:deseq2 (1)' #187

Closed Gavin098 closed 2 years ago

Gavin098 commented 2 years ago

Hello @hoelzer I have executed a test run which has been doing great. However, when running on the machine,I have some problems. Please see the following for the entire result shown in the shell.

N E X T F L O W  ~  version 22.02.0-edge
Launching `../test_rnaflow/rnaflow/main.nf` [ridiculous_mestorf] - revision: 4a512a6ad6

Profile: local,conda

Current User: ganjunwei
Nextflow-version: 22.02.0.edge
Starting time: 2022-04-28T22:00:28.172799104+08:00
Workdir location:
  /home/ganjunwei/test_rnaflow4/work
Launchdir location:
  /home/ganjunwei/test_rnaflow4
Permanent cache directory:
  ../test_rnaflow/nextflow-autodownload-databases
Conda cache directory:
  ../test_rnaflow/conda
Configuration files:
  [/home/ganjunwei/test_rnaflow/rnaflow/nextflow.config]
Cmd line:
  nextflow run ../test_rnaflow/rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --mode paired --cores 30 --condaCacheDir ../test_rnaflow/conda --permanentCacheDir ../test_rnaflow/nextflow-autodownload-databases --skip_sortmerna -profile local,conda -resume

CPUs to use: 30, maximal CPUs to use: 144

WARNING: not a stable execution. Please use -r for full reproducibility.

WARNING: Output folder already exists. Results might be overwritten! You can adjust the output folder via [--output]

WARNING: Parameter --mode is deprecated, read mode will automatically be detected from the sample file.

R N A F L O W : R N A - S E Q  A S S E M B L Y  &  D I F F E R E N T I A L  G E N E  E X P R E S S I O N  A N A L Y S I S
= = = = = = =   = = = = = = =  = = = = = = = =  =  = = = = = = = = = = = =  = = = =  = = = = = = = = = =  = = = = = = = =
Output path:                    results
Strandedness                    unstranded
Read mode:                      paired-end
TPM threshold:                  1 
Comparisons:                    all
Nanopore mode:                  false

executor >  local (43)
[53/4ede07] process > concat_genome                                          [100%] 1 of 1, cached: 1 ✔
[9b/efd8fb] process > concat_annotation                                      [100%] 1 of 1, cached: 1 ✔
[fd/72ba79] process > preprocess_illumina:fastqcPre (SD_REP3)                [100%] 6 of 6 ✔
executor >  local (43)
[53/4ede07] process > concat_genome                                          [100%] 1 of 1, cached: 1 ✔
[9b/efd8fb] process > concat_annotation                                      [100%] 1 of 1, cached: 1 ✔
[fd/72ba79] process > preprocess_illumina:fastqcPre (SD_REP3)                [100%] 6 of 6 ✔
[cc/abaab3] process > preprocess_illumina:fastp (RD_REP2)                    [100%] 6 of 6 ✔
[cd/820442] process > preprocess_illumina:fastqcPost (RD_REP2)               [100%] 6 of 6 ✔
[49/fd1a66] process > preprocess_illumina:hisat2index                        [100%] 1 of 1 ✔
[1e/b249c8] process > preprocess_illumina:hisat2 (RD_REP2)                   [100%] 6 of 6 ✔
[d4/a399d1] process > preprocess_illumina:index_bam (RD_REP2)                [100%] 6 of 6 ✔
[da/1eaaf1] process > expression_reference_based:featurecounts (RD_REP2)     [100%] 6 of 6 ✔
[91/8b2041] process > expression_reference_based:format_annotation_gene_rows [100%] 1 of 1 ✔
[5b/c8e1bd] process > expression_reference_based:format_annotation           [100%] 1 of 1 ✔
[7a/82ff39] process > expression_reference_based:tpm_filter                  [100%] 1 of 1 ✔
[b0/b89fa6] process > expression_reference_based:deseq2 (1)                  [100%] 2 of 2, failed: 2, retries: 1 ✘
[01/d12333] process > expression_reference_based:multiqc_sample_names (1)    [100%] 1 of 1, cached: 1 ✔
[-        ] process > expression_reference_based:multiqc (1)                 -
[af/8dffac] NOTE: Process `expression_reference_based:deseq2 (1)` terminated with an error exit status (1) -- Execution is retried (1)
Error executing process > 'expression_reference_based:deseq2 (1)'

Caused by:
  Process `expression_reference_based:deseq2 (1)` terminated with an error exit status (1)                                                                                                                                                                                                                                                                        
Command executed:                                                                                                                                                                 
  R CMD BATCH --no-save --no-restore '--args c(".") c("RD_REP1.counts.filtered.formated.tsv","RD_REP2.counts.filtered.formated.tsv","RD_REP3.counts.filtered.formated.tsv","SD_REP1.counts.filtered.formated.tsv","SD_REP2.counts.filtered.formated.tsv","SD_REP3.counts.filtered.formated.tsv") c("mock","mock","mock","treated","treated","treated") c("RD_REP1","RD_REP2","RD_REP3","SD_REP1","SD_REP2","SD_REP3") c("mock","treated") c("mock:treated") c("annotation.id2details") c("annotation.gene.gtf") c("A","B","C","A","B","C") c("") c("regionReport_DESeq2Exploration_custom.Rmd") c(30) c("ensembl_gene_id")' deseq2.R

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /home/ganjunwei/test_rnaflow4/work/b0/b89fa6a1e6b0b1761e1a4d9d75e20c

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Could you please let me know what's the problem and how to fix it.

Thank you very much.

Gavin

fischer-hub commented 2 years ago

Hi @Gavin098, thanks for reporting this issue!

Could you provide the exact command that you used when running into this error? Also there is a log file of the DESeq2 process in either results/07-DifferentialExpression/DESeq2/deseq2.Rout or in your work directory in /home/ganjunwei/test_rnaflow4/work/b0/b89fa6a1e6b0b1761e1a4d9d75e20c/deseq2.Rout if you can't find it in the results directory. It would be very helpful if you could attach that log file to track down the source of the crash!

Thanks in advance!

Gavin098 commented 2 years ago

Hi @Gavin098, thanks for reporting this issue!

Could you provide the exact command that you used when running into this error? Also there is a log file of the DESeq2 process in either results/07-DifferentialExpression/DESeq2/deseq2.Rout or in your work directory in /home/ganjunwei/test_rnaflow4/work/b0/b89fa6a1e6b0b1761e1a4d9d75e20c/deseq2.Rout if you can't find it in the results directory. It would be very helpful if you could attach that log file to track down the source of the crash!

Thanks in advance!

Ok no problem, thanks for your feedback. My command is as following:

$ nextflow run ../test_rnaflow/rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --mode paired --cores 30 --condaCacheDir ../test_rnaflow/conda --permanentCacheDir ../test_rnaflow/nextflow-autodownload-databases --skip_sortmerna -profile local,conda -resume

I opened the deseq2.Rout and found that it was R that had the problem.

R version 3.5.3 (2019-03-11) -- "Great Truth"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library("DESeq2")
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

Loading required package: BiocParallel

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

    aperm, apply

> library("RColorBrewer")
> library("gplots")

Attaching package: ‘gplots’

The following object is masked from ‘package:IRanges’:

    space

The following object is masked from ‘package:S4Vectors’:

    space

The following object is masked from ‘package:stats’:

    lowess

> library("ggplot2")
> library("ReportingTools")
Error in library("ReportingTools") : 
  there is no package called ‘ReportingTools’
Execution halted
fischer-hub commented 2 years ago

Thanks, that is already very helpful @Gavin098 !

So it seems like the R package Reportingtools was not installed successfully to the DESeq2 conda environment and now it is not available in the script. You could try to delete the ../test_rnaflow/conda/deseq2* environment folder manually and see if Reportingtools will be correctly installed the next time. Sometimes conda environments can get corrupted like this during the installation. Another option would be to use the singularity or docker profile of the pipeline instead of the conda profile. This has the advantage that they run every process in its own docker container which have all the necessary packages already pre installed. Howevever, in case you don't have docker or singularity installed on your machine, the install requires root or sudo privileges.

Let me know if this worked out!

Gavin098 commented 2 years ago

Thank you for the guidance! Based on your suggestion, I rerun the programnextflow run rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --cores 40 -profile local,conda --output result2.But a new problem arose and I opened the deseq2.Rout I don't understand how to set the -r parameter, and I'd appreciate it if you could tell me

library("DESeq2")
library("RColorBrewer")
library("gplots")
library("ggplot2")
library("ReportingTools")
library("pheatmap")
library("biomaRt")
library("svglite")
library("piano")
library("apeglm")
library("EnhancedVolcano")
library("regionReport")
library("stringr")
library("WebGestaltR")
library("snowfall")
library("openxlsx")

#####################################################################################
## FUNCTIONS
#####################################################################################
build.project.structure <- function(out.dir) {
  dir.create(file.path(out.dir), showWarnings = FALSE)
  # Build necessary project structure
  dir.create(file.path(out.dir, '/results'), showWarnings = FALSE)
  dir.create(file.path(out.dir, '/reports'), showWarnings = FALSE)
  dir.create(file.path(out.dir, '/input'), showWarnings = FALSE)
  for (plot.type in c('volcano', 'PCA', 'heatmaps', 'MA', 'sample2sample')) {
    dir.create(file.path(out.dir, paste0('/plots/', plot.type)), showWarnings = FALSE, recursive = TRUE)
  }
}

write.table.to.file <- function(as.data.frame.object, output.path, output.name, id2name, row.names=TRUE, col.names=TRUE) {
  output.file.basename <- paste0(output.path, "/", output.name)
  write.table(as.data.frame.object, file=paste0(output.file.basename, ".csv"), sep = ",", row.names=row.names, col.names=col.names)
  if( is.na(col.names) ){
    write.xlsx(as.data.frame.object, file=paste0(output.file.basename, ".xlsx"), row.names=row.names, col.names=TRUE, asTable=TRUE)
  } else {
    write.xlsx(as.data.frame.object, file=paste0(output.file.basename, ".xlsx"), row.names=row.names, col.names=col.names, asTable=TRUE)
  }

  if ( !missing(id2name)) {
    output.file.basename.extended <- paste0(output.path, "/", output.name, "_extended")
    ## add real gene names and biotypes to the csv files
    system(paste("./improve_deseq_table.rb", paste0(output.file.basename.extended, ".csv" ), paste0(output.file.basename, ".csv"), id2name, sep=" "), wait=TRUE)
    write.xlsx(read.csv(paste0(output.file.basename.extended, ".csv" )), paste0(output.file.basename.extended, ".xlsx" ), asTable=TRUE)
  }
}

plot.sample2sample <- function(out.dir, col.labels, trsf_data, trsf_type, colors) {
  ## get sample-to-sample distances
  sampleDists <- dist(t(assay(trsf_data)))
  sampleDistMatrix <- as.matrix(sampleDists)
  ## add names
  rownames(sampleDistMatrix) <- with(colData(trsf_data), col.labels)
  colnames(sampleDistMatrix) <- with(colData(trsf_data), col.labels)

  pdf(paste(out.dir, paste0("sample2sample_", trsf_type, ".pdf"), sep="/"))
  pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, color = colors)
  dev.off()
  svg(paste(out.dir, paste0("sample2sample_", trsf_type, ".svg"), sep="/"))
  pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, color = colors)
  dev.off()

  # sample2sample heatmap with color key and histogram
  # hc <- hclust(sampleDists)
  # heatmap.2(sampleDistMatrix, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col = colors, margin=c(13, 13))
}

plot.ma <- function(output.dir, deseq2.res, alpha) {
  pdf(paste(output.dir, paste0("MA_alpha", alpha, ".pdf"), sep="/"))
  plotMA(deseq2.res, alpha = alpha, main = paste('MA plot with alpha =', alpha))
  dev.off()
  svg(paste(output.dir, paste0("MA_alpha", alpha, ".svg"), sep="/"))
  plotMA(deseq2.res, alpha = alpha, main = paste('MA plot with alpha =', alpha))
  dev.off()
}

reportingTools.html <- function(out.dir, dds, deseq2.result, pvalueCutoff, condition1, condition2, annotation_genes, make.plots=TRUE) {
  # Exporting results to HTML and CSV
  if (pvalueCutoff == 1.1){
    shortName <- 'RNAseq_analysis_with_DESeq2_full'
    title <- paste0('RNA-seq analysis of differential expression using DESeq2, no P value cutoff')
  } else {
    shortName <- paste0('RNAseq_analysis_with_DESeq2_p', pvalueCutoff)
    title <- paste0('RNA-seq analysis of differential expression using DESeq2, P value cutoff ', pvalueCutoff)
  }
  if (make.plots == FALSE) {
    dir.create(file.path(paste0(out.sub, '/reports/figures', shortName)), showWarnings = FALSE)
    for ( id in rownames(deseq2.result[ !is.na(deseq2.result$padj) & deseq2.result$padj < pvalueCutoff, ]) ) {
      system(paste0('cp ', out.dir, '/reports/figuresRNAseq_analysis_with_DESeq2_full/boxplot.', id, '.pdf ', out.sub, '/reports/figures', shortName))
      system(paste0('cp ', out.dir, '/reports/figuresRNAseq_analysis_with_DESeq2_full/mini.', id, '.png ', out.sub, '/reports/figures', shortName))
    }
  }
  des2Report <- HTMLReport(shortName=shortName, title=title, basePath=out.dir, reportDirectory="reports/")
  publish(dds, des2Report, pvalueCutoff=pvalueCutoff, annotation.db=NULL, factor=colData(dds)$condition, reportDir=out.dir, n=length(row.names(deseq2.result)), contrast=c("condition",condition1,condition2), make.plots=make.plots)
  finish(des2Report)
  system(paste('./refactor_reportingtools_table.rb', paste0(out.dir, '/reports/', shortName,'.html'), annotation_genes, 'add_plots', pvalueCutoff, sep=" "))
}

plot.pca <- function(out.dir, col.labels, trsf_data, trsf_type, ntop) {
  # calculate the variance for each gene
  rv <- rowVars(assay(trsf_data))
  # select the ntop genes by variance
  select <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]
  # Extract the data 
  X <- t(assay(trsf_data)[select,]) # Transpose this as our read count matrix as R has dimensions as columns and not as rows (thanks, R!!!)

  # Using R's internal function for improved speed (and accuracy as they use SDV)
  # Caution: R will not consider all eigenvectors (there are thousands of genes)
  # Theoretically, we need to calculate ALL of them (we then obtain PC1, PC2, ... PCm with m dimensions = genes)
  # But R will truncate it to PC1, PC2, ... PCn with n data points (if n < m), which is fast.
  # Don't let this confuse you
  pca <- prcomp(X, center = TRUE, scale = FALSE) # default: center = TRUE, scale = FALSE

  # the contribution to the total variance for each component
  percentVar <- pca$sdev^2 / sum( pca$sdev^2 )

  intgroup <- c("condition")
  if (!all(intgroup %in% names(colData(trsf_data)))) {
    stop("the argument 'intgroup' should specify columns of colData(dds)")
  }
  intgroup.df <- as.data.frame(colData(trsf_data)[, intgroup, drop=FALSE])

  # add the intgroup factors together to create a new grouping factor
  group <- if (length(intgroup) > 1) {
    factor(apply( intgroup.df, 1, paste, collapse=":"))
  } else {
    colData(trsf_data)[[intgroup]]
  }

  d <- data.frame(PC1=pca$x[,1], PC2=pca$x[,2], group=group, intgroup.df, name=col.labels)

  ggplot(data=d, aes_string(x="PC1", y="PC2", colour="condition")) +
    geom_point(size=3) + 
    xlab(paste0("PC1: ",round(percentVar[1] * 100),"% variance")) +
    ylab(paste0("PC2: ",round(percentVar[2] * 100),"% variance")) +
    ggtitle(paste("PC1 vs PC2: top ", ntop, " variable genes")) +
    coord_fixed() +
    ggsave(paste(out.dir, paste0("PCA_simple_", trsf_type, "_top", ntop, ".pdf"), sep="/"))
    ggsave(paste(out.dir, paste0("PCA_simple_", trsf_type, "_top", ntop, ".svg"), sep="/"))
}

plot.heatmap.most_var <- function(out.dir, dds, trsf_data, trsf_type, ntop, samples.info=df.samples.info, genes.info=df.gene.anno) {
  select <- order(rowVars(counts(dds,normalized=TRUE)),decreasing=TRUE)
  select <- select[1:min(ntop, length(select))][1:min(ntop, length(select))]
  selected.ids <- row.names(trsf_data[select,])
  if ( length(selected.ids) > 1 ) {
    file <- paste(out.dir, paste0("heatmap_count_matrix_", trsf_type, "_mostVar", ntop, "_row-scaled.pdf"), sep="/")
    pheatmap(assay(trsf_data)[select,], cluster_cols = FALSE, cluster_rows = TRUE,
          scale = "row", border_color = NA,
          labels_row = as.character(genes.info[selected.ids,]$gene_type),
          annotation_col=samples.info[ , !(colnames(samples.info) == 'columns'), drop=FALSE],
          labels_col = as.character(samples.info[colnames(trsf_data),]$columns),
          height = 12, width = 8, file = file)
  } else {
    print('SKIPPING: plot.heatmap.most_var. Only one feature to plot.')
  }
}

plot.heatmap.top_counts <- function(out.dir, dds, trsf_data, trsf_type, ntop, samples.info=df.samples.info, genes.info=df.gene.anno) {
  select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)
  select <- select[1:min(ntop, length(select))]
  selected.ids <- row.names(counts(dds,normalized=TRUE)[select,])
  if ( length(selected.ids) > 1 ) {
    file <- paste(out.dir, paste0("heatmap_count_matrix_", trsf_type, "_top", ntop, "Counts_row-scaled.pdf"), sep="/")
    pheatmap(assay(trsf_data)[select,], cluster_cols = FALSE, cluster_rows = TRUE,
            scale = "row", border_color = NA,
            labels_row = as.character(genes.info[selected.ids,]$gene_type),
            annotation_col=samples.info[ , !(colnames(samples.info) == 'columns'), drop=FALSE],
            labels_col = as.character(samples.info[colnames(trsf_data),]$columns),
            height = 12, width = 8, file = file)
  } else {
    print('SKIPPING: plot.heatmap.top_counts. Only one feature to plot.')
  }
}

plot.heatmap.top_fc <- function(out.dir, resFold, trsf_data, trsf_type, ntop, pcutoff='', samples.info=df.samples.info, genes.info=df.gene.anno) {
  selected.ids <- row.names(resFold[order(resFold$log2FoldChange, decreasing=TRUE), ])
  selected.ids <- selected.ids[1:min(ntop, length(selected.ids))]
  if ( length(selected.ids) > 1 ) {
    file <- paste(out.dir, paste0("heatmap_count_matrix_", trsf_type, "_top", ntop, "log2FC", pcutoff, "_row-scaled.pdf"), sep="/")
    pheatmap(assay(trsf_data)[selected.ids,], cluster_cols = FALSE, cluster_rows = TRUE, 
            scale = "row", border_color = NA, 
            labels_row = as.character(genes.info[selected.ids,]$gene_type),
            annotation_col=samples.info[ , !(colnames(samples.info) == 'columns'), drop=FALSE],
            labels_col = as.character(samples.info[colnames(trsf_data),]$columns),
            height = 12, width = 8, file = file)
  } else {
    print('SKIPPING: plot.heatmap.top_fc. Only one feature to plot.')
  }
}

piano <- function(out.dir, resFold, mapGO, cpus) {
  mapGO <- mapGO[mapGO[,2]!="",]
  write.table.to.file(mapGO, out.dir, "ENSG_GOterm", row.names = FALSE)

  myGsc <- loadGSC(mapGO)

  myPval <- resFold$padj
  names(myPval) <- rownames(resFold)
  myFC <- resFold$log2FoldChange
  names(myFC) <- rownames(resFold)

  if (cpus >= 10) {
    piano_cpus = 10
  } else {
    piano_cpus = 1
  }

  gene.set.min <- 20
  gene.set.max <- 'inf' # 9999999999999
  gsaRes1 <- runGSA(myFC, geneSetStat="maxmean", gsc=myGsc,
                  gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
  gsaRes2 <- runGSA(myFC, geneSetStat="gsea", gsc=myGsc,
                  gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
  gsaRes3 <- runGSA(myFC, geneSetStat="fgsea", gsc=myGsc,
                  gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
  gsaRes4 <- runGSA(myFC, geneSetStat="page", gsc=myGsc,
                  gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
  gsaRes5 <- runGSA(myPval, myFC, geneSetStat="fisher", gsc=myGsc,
                  gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
  gsaRes6 <- runGSA(myPval, myFC, geneSetStat="stouffer", gsc=myGsc,
                  gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
  gsaRes7 <- runGSA(myPval, myFC, geneSetStat="reporter", gsc=myGsc,
                  gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
  gsaRes8 <- runGSA(myPval, myFC, geneSetStat="tailStrength", gsc=myGsc,
                  gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)

  resList <- list(gsaRes1,gsaRes2,gsaRes3,gsaRes4,gsaRes5,gsaRes6,gsaRes7,gsaRes8)
  names(resList) <- c("maxmean", "gsea", "fgsea", "page", "fisher", "stouffer", "reporter", "tailStrength")

  try.piano <- try( {
    pdf(paste(out.dir,"/consensus_heatmap.pdf",sep=""), width = 10, height = 10)
    ch <- consensusHeatmap(resList,cutoff=10,method="mean",colorkey=FALSE,cellnote="consensusScore",ncharLabel = 120) ## medianPvalue or consensusScore or nGenes
    dev.off()
    svg(paste(out.dir,"/consensus_heatmap.svg",sep=""), width = 10, height = 10)
    ch <- consensusHeatmap(resList,cutoff=10,method="mean",colorkey=FALSE,cellnote="consensusScore",ncharLabel = 120) ## medianPvalue or consensusScore
    dev.off()

    downregulated_paths <- as.data.frame(ch$pMat[,1][ch$pMat[,1] < 0.05])
    upregulated_paths <- as.data.frame(ch$pMat[,5][ch$pMat[,5] < 0.05])

    write.table.to.file(downregulated_paths, out.dir, "paths_sigdown", col.names=FALSE)
    write.table.to.file(upregulated_paths, out.dir, "paths_sigup", col.names=FALSE)
    }
  )
  if (class(try.piano) == "try-error") {
    print('SKIPPING: piano consensusHeatmap.')
  }

  # for (i in 1:length(resList)){
  #   svg(paste(out.dir, paste0(names(resList)[i], '.svg'), sep='/'), width = 10, height = 10)
  #   networkPlot(resList[[i]], class="non")
  #   dev.off()
  # }
}

##################### TODO
# plot.ma.go <- function(out, deseq2.res, ma.size, results.gene, go.terms, trsf_data, trsf_type) {
#   ## We can also make an MA-plot for the results table in which we raised
#   ## the log2 fold change threshold (Figure below). We can label individual
#   ## points on the MA-plot as well. Here we use the with R function to plot
#   ## a circle and text for a selected row of the results object. Within the
#   ## with function, only the baseMean and log2FoldChange values for the
#   ## selected rows of res are used.
#   ##-----------------------------
#   for (go.term.ma in go.terms) {
#     #go.term.ma <- "GO:0009615"
#     pdf(paste(out,"statistics/ma_",trsf_type,"_", gsub(":", "", go.term.ma), ".pdf",sep=""))
#     plotMA(deseq2.res, main=paste("DESeq2, ", go.term.ma, sep=''), ylim=ma.size)
#     results.gene.GO.ma <- grep(go.term.ma, results.gene$go_id, fixed=TRUE)  ### e.g. GO:0002376, immune system process in mice
#     trsf_data.go.ma <- rownames(assay(trsf_data)[results.gene[results.gene.GO.ma,]$ensembl_gene_id,]) # get the ensembl ids corresponding to this go term
#     for (gene in trsf_data.go.ma) {
#       index = which(ensembl.ids == gene)
#       gene.name <- toString(gene.ids[index])
#       with(deseq2.res[gene, ], {
#         if (gene %in% rownames(resFold05)) {
#           points(baseMean, log2FoldChange, col="dodgerblue", cex=0.8, lwd=2, bg="dodgerblue")
#           text(baseMean, log2FoldChange, gene.name, pos=2, col="dodgerblue")
#         }
#       })
#     }
#     dev.off()
#   }
# }
#####################################################################################
## END FUNCTIONS
#####################################################################################

#####################################################################################
## MAIN 
#####################################################################################

##########################################
## Preparation
##########################################

#####################
## Parse arguments

args <- commandArgs(TRUE) # Read the arguments passed from the command line and assigns them to a vector of characters

## Parse the arguments (in characters) and evaluate them
project_dir <- eval( parse(text=args[1]) )[1] 
samples <- eval( parse(text=args[2]) )
conditions <- eval( parse(text=args[3]) )
col.labels <- eval( parse(text=args[4]) )
levels <- eval( parse(text=args[5]) )
comparisons <- eval( parse(text=args[6]) )
id2name <- eval( parse(text=args[7]) )[1]
annotation_genes <- eval( parse(text=args[8]) )[1]
sources <- eval( parse(text=args[9]) )
species <- eval( parse(text=args[10]) )
regionReport_config  <- eval( parse(text=args[11]) )[1]
regionReport_config <- normalizePath(regionReport_config) # regionReport needs the absolute path
cpus <- eval( parse(text=args[12]) )
id_type <- eval( parse(text=args[13]) )
#go.terms <- c()
#go.terms <- eval( parse(text=args[12]) ) # c("GO:004563","GO:0011231",...)

#####################
## Read in ensembl ids, gene names and biotypes from a tab seperated table
df.gene.anno <- as.data.frame( read.table(id2name, header=FALSE, sep="\t") )
rownames(df.gene.anno) <- df.gene.anno$V1
df.gene.anno$V1 <- NULL
colnames(df.gene.anno) <- c('gene_symbol', 'biotype')
df.gene.anno$gene_type <- paste(df.gene.anno$gene_symbol, str_replace(df.gene.anno$biotype, '_', ' '), sep=', ')

#####################
## Build project structure
out <- paste(project_dir,'/',sep='') # deseq2 dir is created by nextflow in the results dir ()
dir.create(file.path(out), showWarnings = FALSE)
dir.create(file.path(out, 'plots'), showWarnings = FALSE)
dir.create(file.path(out, 'plots/PCA'), showWarnings = FALSE, recursive = TRUE)
dir.create(file.path(out, 'plots/heatmaps'), showWarnings = FALSE, recursive = TRUE)
dir.create(file.path(out, 'data/input'), showWarnings = FALSE, recursive = TRUE)
dir.create(file.path(out, 'data/counts'), showWarnings = FALSE, recursive = TRUE)

##########################################
## DESeq2 stuff
##########################################

#####################
## Create input object
if (length(sources) > 0) {
    sampleTable <- data.frame(sampleName = samples, fileName = samples, condition = conditions, type = col.labels, sources = sources, design = paste(sources, conditions, sep = ':'))
    ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, design= ~ sources + condition)
} else {
    sampleTable <- data.frame(sampleName = samples, fileName = samples, condition = conditions, type = col.labels, design = conditions)
    ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, design= ~ condition)
}
## adjust DESeq2 object to avoid comparision in alphabetical order(!!!1!1):
## order factor() in every level according to input files
ddsHTSeq$condition <- factor(ddsHTSeq$condition, levels=levels)
ddsHTSeq$type <- factor(ddsHTSeq$type, levels=col.labels)

#####################
## Create DESeqDataSet object by running DESeq2 on the input object
dds <- DESeq(ddsHTSeq)

##########################################
## Write input
##########################################

## raw input
if (length(sources) > 0) {
  df.samples.info <- data.frame(samples = samples, columns = col.labels, conditions = conditions, sources = sources)
} else {
  df.samples.info <- data.frame(samples = samples, columns = col.labels, conditions = conditions)
}
rownames(df.samples.info) <- df.samples.info$samples
df.samples.info$samples <- NULL
write.table.to.file(df.samples.info, paste0(out, "/data/input"), "input", col.names=NA)

## DESeq2 input
input.summary <- paste(out, "/data/input/", "DESeq2_input_summary.txt", sep="/") 
cat("Count input object:\n", file=input.summary, append=TRUE)
sink(input.summary, append=TRUE)
print(ddsHTSeq)
sink()
cat("\n\nCondition of count input object:\n", file=input.summary, append=TRUE)
sink(input.summary, append=TRUE)
print(ddsHTSeq$condition)
sink()
cat("\n\nDESeqDataSet object:\n", file=input.summary, append=TRUE)
sink(input.summary, append=TRUE)
print(dds)
sink()

##########################################
## Normalization and transformation
##########################################

#####################
## normalize counts
norm.counts <- counts(dds, normalized=T)

#####################
## write normalized counts and size factors
write.table.to.file(as.data.frame(norm.counts), paste0(out, "/data/counts"), "normalized_counts", col.names=NA)
write.table.to.file(as.data.frame(dds$sizeFactor), paste0(out, "/data/counts"), "sizeFactors", col.names=NA)

#####################
## transform counts
rld <- rlog(dds, blind=FALSE) # regularized log transformation
try.vst <- try(
  vsd <- vst(dds, blind=FALSE) # variance stabilizing transformation (VST)
)
if (class(try.vst) == "try-error") {
  vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
}
ntd <- normTransform(dds) # log2(n + 1) transformation

#####################
## collect transformed counts for easy iterating
transformed.counts = vector(mode="list", length=3)
names(transformed.counts) = c("vsd", "rld", "ntd")
transformed.counts[[1]] <- vsd; transformed.counts[[2]] <- rld; transformed.counts[[3]] <- ntd

#####################
## write transformed counts 
for (i in 1:length(transformed.counts)) {
  write.table.to.file(as.data.frame(assay(transformed.counts[[i]])), paste0(out, "/data/counts"), paste0("transformed_counts_", names(transformed.counts)[[i]]), col.names=NA)
}

##########################################
## Visualisation
##########################################

#####################
## PCA
for (i in 1:length(transformed.counts)) { 
  for (ntop in c(500, 100, 50)){
    plot.pca(paste(out, "plots/PCA/", sep="/"), col.labels, transformed.counts[[i]], names(transformed.counts)[[i]], ntop)
  }
}

#####################
## Heatmaps on counts
for (i in 1:length(transformed.counts)) { 
  for (ntop in c(50, 100)){
    plot.heatmap.top_counts(paste(out, "plots/heatmaps/", sep="/"), dds, transformed.counts[[i]], names(transformed.counts)[[i]], ntop)
  }
}

#####################
## Heatmaps on counts, most variable transformed genes
for (i in 1:length(transformed.counts)) { 
  for (ntop in c(50, 100)){
    plot.heatmap.most_var(paste(out, "plots/heatmaps/", sep="/"), dds, transformed.counts[[i]], names(transformed.counts)[[i]], ntop)
  }
}

##########################################
## BiomaRt object
##########################################
try.biomart <- try(
  if (species == 'mmu'){
    biomart.ensembl <- useMart('ensembl', dataset='mmusculus_gene_ensembl')
  } else if (species == 'hsa') {
    biomart.ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
  } else if (species == 'mau') {
    biomart.ensembl <- useMart('ensembl', dataset='mauratus_gene_ensembl')
  } else {
    biomart.ensembl <- NA
    print('SKIPPING: BiomaRt. Species not accasible with BiomaRt.')
  }
)
if (class(try.biomart) == "try-error") {
  biomart.ensembl <- NA
  print('SKIPPING: BiomaRt. BiomaRt is not accessible.')
}

#####################################################################################
## PERFORM PAIRWISE COMPARISONS
#####################################################################################
for (comparison in comparisons) {

  l1 <- strsplit(comparison, ':')[[1]][1]
  l2 <- strsplit(comparison, ':')[[1]][2]

  out.sub <- paste(out, l1, '_vs_', l2, '/', sep='')
  build.project.structure(out.sub)

  name <- paste("deseq2_",l1,"_",l2,sep="")

  ##########################################
  ## Adjust data, count data, levles and valiables to current pairwise comparison
  ##########################################

  dds.sub <- dds[ , dds$condition %in% c(l1, l2) ]
  dds.sub$condition <- droplevels(dds.sub$condition)
  dds.sub$type <- droplevels(dds.sub$type)

  ## transformed count data
  rld.sub <- rld[ , rld$condition %in% c(l1, l2) ]
  rld.sub$condition <- droplevels(rld.sub$condition)
  rld.sub$type <- droplevels(rld.sub$type)

  vsd.sub <- vsd[ , vsd$condition %in% c(l1, l2) ]
  vsd.sub$condition <- droplevels(vsd.sub$condition)
  vsd.sub$type <- droplevels(vsd.sub$type)

  ntd.sub <- ntd[ , ntd$condition %in% c(l1, l2) ]
  ntd.sub$condition <- droplevels(ntd.sub$condition)
  ntd.sub$type <- droplevels(ntd.sub$type)

  transformed.counts.sub = vector(mode="list", length=3)
  names(transformed.counts.sub) = c("vsd", "rld", "ntd")
  transformed.counts.sub[[1]] <- vsd.sub; transformed.counts.sub[[2]] <- rld.sub; transformed.counts.sub[[3]] <- ntd.sub

  ## adjust variabels
  conditions.sub <- c()
  col.labels.sub <- c()
  samples.sub <- c()
  levels.sub <- c(l1, l2)
  for (pos in which(conditions == l1)) {
    conditions.sub <- c(conditions.sub, conditions[pos])
    col.labels.sub <- c(col.labels.sub, col.labels[pos])
    samples.sub <- c(samples.sub, samples[pos])
  }
  for (pos in which(conditions == l2)) {
    conditions.sub <- c(conditions.sub, conditions[pos])
    col.labels.sub <- c(col.labels.sub, col.labels[pos])
    samples.sub <- c(samples.sub, samples[pos])
  }

  ##########################################
  ## Perform the pairwise comparison 
  ## code inspiration: https://github.com/acidgenomics/DESeqAnalysis/blob/master/R/apeglmResults-methods.R#L95
  ##########################################

  factor <- "condition"
  numerator <- l1
  denominator <- l2
  group <- colData(dds)[[factor]]
  group <- relevel(x = group, ref = denominator)
  colData(dds)[[factor]] <- group
  dds <- DESeq(dds) # nbinomWaldTest() via DESeq(), but the dispersion does not have to be estimated again | was not done before
  resultsNames <- resultsNames(dds)
  coef <- match(
    x = paste(factor, numerator, "vs", denominator, sep = "_"),
    table = resultsNames )
  deseq2.res <- lfcShrink(
        dds = dds,
        type = "apeglm",
        coef = coef
  )

  ##########################################
  ## Order and filter output
  ##########################################

  ## ordered by smallest adjusted p value
  resOrdered <- deseq2.res[order(deseq2.res$padj), ]

  ## filter NA values in fc and padj
  resNA = deseq2.res[ !is.na(deseq2.res$log2FoldChange) , ]
  resNA = resNA[ !is.na(resNA$padj) , ]

  ## filter 0 baseMean
  resBaseMean = resNA[ resNA$baseMean > 0.0 , ]

  ## resFold is now sorted by abs(foldchange) and all NA entries are removed as well as all zero baseMean values
  resFold <<-resBaseMean[rev(order(abs(resBaseMean$log2FoldChange))),]

  ## filter for specific adjusted P value
  resFold05 <<- resFold[ resFold$padj < 0.05 , ]
  resFold01 <<- resFold[ resFold$padj < 0.01 , ]

  ##########################################
  ## Write input and output
  ##########################################

  #####################
  ## input
  df.samples.info.sub <- df.samples.info[samples.sub,]
  write.table.to.file(df.samples.info.sub,  paste0(out.sub, "/input"), "input", col.names=NA)

  #####################
  ## DESeq2 results
  out.sub.output.dir <- paste0(out.sub, "/results/")

  write.table.to.file(as.data.frame(resOrdered), out.sub.output.dir, paste(name, "full", sep="_"), id2name, col.names=NA)
  write.table.to.file(as.data.frame(resFold), out.sub.output.dir, paste(name, "filtered_NA", sep="_"), id2name, col.names=NA)
  write.table.to.file(as.data.frame(resFold05), out.sub.output.dir, paste(name, "filtered_padj_0.05", sep="_"), id2name, col.names=NA)
  write.table.to.file(as.data.frame(resFold01), out.sub.output.dir, paste(name, "filtered_padj_0.01", sep="_"), id2name, col.names=NA)

  #####################
  ## DESeq2 results summary
  summary <- paste(out.sub.output.dir,"summary.txt",sep="/")
  sink(summary)
  summary(deseq2.res)
  sink()
  cat("#deseq2.res$padj < 0.1:\nFALSE\tTRUE\n", file=summary, append=TRUE)
  cat(table(deseq2.res$padj < 0.1), file=summary, append=TRUE)
  cat("\n\n", file=summary, append=TRUE)
  cat("#deseq2.res$padj < 0.05:\nFALSE\tTRUE\n", file=summary, append=TRUE)
  cat(table(deseq2.res$padj < 0.05), file=summary, append=TRUE)
  cat("\n\n", file=summary, append=TRUE)
  cat("#deseq2.res$padj < 0.01:\nFALSE\tTRUE\n", file=summary, append=TRUE)
  cat(table(deseq2.res$padj < 0.01), file=summary, append=TRUE)
  cat("\n", file=summary, append=TRUE)

  ##########################################
  ## Plots
  ##########################################

  #####################
  ## Volcano plot
  deseq2.res.anno <- merge(as.data.frame(deseq2.res), df.gene.anno, by=0)
  rownames(deseq2.res.anno) <- deseq2.res.anno$Row.names
  volcano = EnhancedVolcano(deseq2.res.anno, lab = deseq2.res.anno$gene_symbol, x = 'log2FoldChange', y = 'padj', 
    legendLabels = c('NS', expression(Log[2]~FC), "adj. p-value", expression(adj.~p-value~and~log[2]~FC)))
  volcano + 
    ggsave(paste(out.sub,"/plots/volcano/volcano.svg", sep='/')) +
    ggsave(paste(out.sub,"/plots/volcano/volcano.pdf", sep='/'))

  #####################
  ## MA plots
  plot.ma(paste(out.sub, "/plots/MA/", sep="/"), deseq2.res, metadata(deseq2.res)$alpha)
  plot.ma(paste(out.sub, "/plots/MA/", sep="/"), deseq2.res, metadata(deseq2.res)$alpha / 2)

  #####################
  ## Heatmaps of sample2sample distances
  for (i in 1:length(transformed.counts.sub)) {
    plot.sample2sample(paste(out.sub, "/plots/sample2sample/", sep="/"), col.labels.sub, 
      transformed.counts.sub[[i]], names(transformed.counts.sub)[[i]], colorRampPalette( rev(brewer.pal(9, "Blues")) )(255))
  }

  #####################
  ## Heatmaps on counts, most variable transformed genes
  for (i in 1:length(transformed.counts.sub)) { 
    for (ntop in c(50, 100)){
      plot.heatmap.most_var(paste(out.sub, "plots/heatmaps/", sep="/"), dds.sub, transformed.counts.sub[[i]], names(transformed.counts.sub)[[i]], ntop)
    }
  }

  #####################
  ## Heatmaps on counts, top count genes
  for (i in 1:length(transformed.counts.sub)) { 
    for (ntop in c(50, 100)){
      plot.heatmap.top_counts(paste(out.sub, "plots/heatmaps/", sep="/"), dds.sub, transformed.counts.sub[[i]], names(transformed.counts.sub)[[i]], ntop)
    }
  }

  #####################
  ## Heatmaps on counts, top FC genes
  for (i in 1:length(transformed.counts.sub)) {
    for (ntop in c(50, 100)){
      plot.heatmap.top_fc(paste(out.sub, "plots/heatmaps/", sep="/"), resFold, transformed.counts.sub[[i]], names(transformed.counts.sub)[[i]], ntop)
      plot.heatmap.top_fc(paste(out.sub, "plots/heatmaps/", sep="/"), resFold05, transformed.counts.sub[[i]], names(transformed.counts.sub)[[i]], ntop, 'pcutoff0-05')
      plot.heatmap.top_fc(paste(out.sub, "plots/heatmaps/", sep="/"), resFold01, transformed.counts.sub[[i]], names(transformed.counts.sub)[[i]], ntop, 'pcutoff0-01')
    }
  }

  #####################
  ## PCA
  for (i in 1:length(transformed.counts.sub)) {
    for (ntop in c(500, 100, 50)) {
      plot.pca(paste(out.sub, "/plots/PCA/", sep="/"), col.labels.sub, transformed.counts.sub[[i]], names(transformed.counts.sub)[[i]], ntop)
    }
  }

  ##########################################
  ## Further analysis
  ##########################################

  #####################
  ## MA plots with genes colored by GO terms
  # go.terms <- eval( c("GO:004563","GO:0011231") )
  # if ( ! is.na(biomart.ensembl)  && length(go.terms) > 0) {
  #   results.gene <- getBM(attributes = c("ensembl_gene_id", "external_gene_name", "go_id","name_1006"), filters = "ensembl_gene_id", values = rownames(deseq2.res), mart=biomart.ensembl)
  #   #   plot.ma.go(out.sub, deseq2.res, ma.size, results.gene, go.terms, transformed.counts.sub[[i]], names(transformed.counts.sub)[[i]])
  # }

  #####################
  ## Piano
  if ( ! is.na(biomart.ensembl) ) {
    dir.create(file.path(out.sub, '/downstream_analysis/piano'), showWarnings = FALSE, recursive = TRUE)
    if (any(grepl(id_type, listAttributes(biomart.ensembl)$name, fixed=TRUE))){
      results.gene <- getBM(attributes =  c(id_type, "name_1006"), filters = id_type, values = rownames(resFold05), mart=biomart.ensembl)
      if ( length(row.names(results.gene)) > 0 ) {
        try.piano <- try( 
          piano(paste(out.sub, 'downstream_analysis', 'piano', sep='/'), resFold05, results.gene, cpus)
        )
        if (class(try.piano) == "try-error") {
          print ('SKIPPING: Piano. Some error occurred.')
        }
      } else {
        print(paste('SKIPPING: Piano. No matching feature IDs with type', id_type, 'found.'))
      }
    } else {
      print(paste('SKIPPING: Piano. Feature ID type', id_type, 'not supported by biomaRt.'))
    }
  }

  #####################
  ## Webgestalt
  if ( species == 'hsa' ){
    organism <- "hsapiens"
  } else if (species == 'mmu') {
    organism <- "mmusculus"
  } else {
    organism <- NA
  }
  if (! is.na(organism)) {
    dir.create(file.path(out.sub, '/downstream_analysis/WebGestalt'), showWarnings = FALSE, recursive = TRUE)
    interestGene <- as.data.frame(resFold05)[, 'log2FoldChange', drop=FALSE]
    interestGene$id <- rownames(interestGene)
    rownames(interestGene) <- NULL
    colnames(interestGene) <- NULL
    interestGene <- interestGene[c(2,1)]
    webgestalt.out.dir <- paste(out.sub, "downstream_analysis", "WebGestalt", sep='/')
    if (any(grepl(id_type, listIdType(), fixed=TRUE))) {
      try.webgestalt <- try(
        for (enrDB in c("geneontology_Biological_Process_noRedundant", "pathway_KEGG")){
          enrichResult <- WebGestaltR(enrichMethod="GSEA", organism=organism, enrichDatabase=enrDB, interestGene=interestGene, interestGeneType=id_type, collapseMethod="mean", minNum=10, maxNum=500, fdrMethod="BH", sigMethod="fdr", fdrThr=0.01, topThr=10, perNum=1000, isOutput=TRUE, outputDirectory=webgestalt.out.dir, projectName=paste0(l1, '_vs_', l2))
        }
      )
      if (class(try.webgestalt) == "try-error") {
        print('SKIPPING: WebGestaltR. The number of annotated IDs for all functional categories are not from 10 to 500 for the GSEA enrichment method.')
      }
    } else {
      print(paste('SKIPPING: WebGestaltR. Feature ID', id_type, 'not supported.'))
    }
  }

  ##########################################
  ## Reports
  ##########################################

  #####################
  ## regionReport report
  ## needs knitr version 1.29. Some bug seems not to be fixed in 1.30 Anaconda version
  ## set output
  report.project.name <- paste(l1, "vs", l2, sep=" ")
  report.dir <- paste(out.sub, "reports", sep="/")
  report.output <- paste0('DESeq2_results_exploration')

  ## create html
  report_html <- DESeq2Report(dds, project = report.project.name,
    intgroup = c('condition', 'type'), res = deseq2.res, template = regionReport_config,
    outdir = report.dir, output = report.output, theme = theme_bw())

  ## and also pfd
  try( report_pdf <- DESeq2Report(dds, project = report.project.name,
    intgroup = c('condition', 'type'), res = deseq2.res, template = regionReport_config,
    outdir = report.dir, output = report.output, theme = theme_bw(),
    output_format = 'pdf_document', device = 'pdf') )

  #####################
  ## ReportingTools
  reportingTools.html(out.sub, dds, deseq2.res, 1.1, l1, l2, annotation_genes)
  if (length(rownames(resFold05)) > 0) {
    reportingTools.html(out.sub, dds, deseq2.res, 0.05, l1, l2, annotation_genes, make.plots=FALSE)
  }
  if (length(rownames(resFold01)) > 0) {
    reportingTools.html(out.sub, dds, deseq2.res, 0.01, l1, l2, annotation_genes, make.plots=FALSE)
  }

}
#####################################################################################
## END PAIRWISE COMPARISONS
#####################################################################################
fischer-hub commented 2 years ago

Hi @Gavin098, great! It looks like DESeq2 is running normally now. What error do you see on the terminal when the pipeline crashes?

In the project directory (where you can also find the main.nf and README.md files) there should be a file called .nextflow.log. If you could attach that file we could see what caused the crash. The deseq2.Rout file is only the log file of the deseq2 process of the pipeline. But this does not seem to be the issue anymore.

The -r option allows users to define a release version to run. You can omit the parameter to run the last committed version of the pipeline (might be unstable) or define a stable release version, for example: -r 1.3.0 (latest release).

Have a nice day!

Gavin098 commented 2 years ago

ok, thanks for your feedback 1: I opened the project directory,but did not find .nextflow.log

image

2: When I add -r 1.3.0 the following is displayed, what's the problem?

N E X T F L O W  ~  version 21.10.6
Revision option cannot be used running a script

3: Please see the following for the entire result shown in the shell.

N E X T F L O W  ~  version 21.10.6
Launching `rnaflow/main.nf` [pedantic_torricelli] - revision: 4a512a6ad6

Profile: local,conda

Current User: ganjunwei
Nextflow-version: 21.10.6
Starting time: 2022-05-11T22:02:07.146+08:00
Workdir location:
  /home/ganjunwei/test_rnaflow/work
Launchdir location:
  /home/ganjunwei/test_rnaflow
Permanent cache directory:
  nextflow-autodownload-databases
Conda cache directory:
  conda
Configuration files:
  [/home/ganjunwei/test_rnaflow/rnaflow/nextflow.config]
Cmd line:
  nextflow run rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --cores 40 -profile local,conda --output result2 -resume

CPUs to use: 40, maximal CPUs to use: 144

WARNING: not a stable execution. Please use -r for full reproducibility.

WARNING: Output folder already exists. Results might be overwritten! You can adjust the output folder via [--output]

R N A F L O W : R N A - S E Q  A S S E M B L Y  &  D I F F E R E N T I A L  G E N E  E X P R E S S I O N  A N A L Y S I S
= = = = = = =   = = = = = = =  = = = = = = = =  =  = = = = = = = = = = = =  = = = =  = = = = = = = = = =  = = = = = = = =
Output path:                    result2
Strandedness                    unstranded  
Read mode:                      paired-end  
TPM threshold:                  1
Comparisons:                    all
Nanopore mode:                  false

executor >  local (3)
[30/aafeca] process > concat_genome                                          [100%] 1 of 1, cached: 1 ✔
[be/2234df] process > concat_annotation                                      [100%] 1 of 1, cached: 1 ✔
executor >  local (3)
[30/aafeca] process > concat_genome                                          [100%] 1 of 1, cached: 1 ✔
[be/2234df] process > concat_annotation                                      [100%] 1 of 1, cached: 1 ✔
[skipped  ] process > download_sortmerna:sortmernaGet                        [100%] 1 of 1, stored: 1 ✔
[5c/52bf9e] process > preprocess_illumina:fastqcPre (RD_REP1)                [100%] 6 of 6, cached: 6 ✔
[e0/62dcc5] process > preprocess_illumina:fastp (SD_REP3)                    [100%] 6 of 6, cached: 6 ✔
[62/33b98a] process > preprocess_illumina:fastqcPost (SD_REP3)               [100%] 6 of 6, cached: 6 ✔
[bf/06c4a6] process > preprocess_illumina:extract_tar_bz2                    [100%] 1 of 1, cached: 1 ✔
[60/600481] process > preprocess_illumina:sortmerna (RD_REP2)                [100%] 6 of 6, cached: 6 ✔
[2a/5f9e26] process > preprocess_illumina:hisat2index                        [100%] 1 of 1, cached: 1 ✔
[cf/9ffe3d] process > preprocess_illumina:hisat2 (RD_REP3)                   [100%] 6 of 6, cached: 6 ✔
[a3/08cd2f] process > preprocess_illumina:index_bam (SD_REP3)                [100%] 6 of 6, cached: 6 ✔
[c0/33ff71] process > expression_reference_based:featurecounts (RD_REP1)     [100%] 6 of 6, cached: 6 ✔
[9a/03a557] process > expression_reference_based:format_annotation_gene_rows [100%] 1 of 1, cached: 1 ✔
[64/741f66] process > expression_reference_based:format_annotation           [100%] 1 of 1, cached: 1 ✔
[9a/0eea5e] process > expression_reference_based:tpm_filter                  [100%] 1 of 1, cached: 1 ✔
[c2/73965f] process > expression_reference_based:deseq2 (1)                  [100%] 2 of 2, failed: 2, retries: 1 ✘
[3b/68e361] process > expression_reference_based:multiqc_sample_names (1)    [100%] 1 of 1, cached: 1 ✔
[e4/f8053c] process > expression_reference_based:multiqc (1)                 [100%] 1 of 1 ✔
[2c/dfdc74] NOTE: Process `expression_reference_based:deseq2 (1)` terminated with an error exit status (1) -- Execution is retried (1)
Error executing process > 'expression_reference_based:deseq2 (1)'

Caused by:
  Process `expression_reference_based:deseq2 (1)` terminated with an error exit status (1)
                                                                                                                                                                                 Command executed:                                                                                                                                                                                                                                                                                                                                                   R CMD BATCH --no-save --no-restore '--args c(".") c("RD_REP1.counts.filtered.formated.tsv","RD_REP2.counts.filtered.formated.tsv","RD_REP3.counts.filtered.formated.tsv","SD_REP1.counts.filtered.formated.tsv","SD_REP2.counts.filtered.formated.tsv","SD_REP3.counts.filtered.formated.tsv") c("mock","mock","mock","treated","treated","treated") c("RD_REP1","RD_REP2","RD_REP3","SD_REP1","SD_REP2","SD_REP3") c("mock","treated") c("mock:treated") c("annotation.id2details") c("annotation.gene.gtf") c("A","B","C","A","B","C") c("") c("regionReport_DESeq2Exploration_custom.Rmd") c(40) c("ensembl_gene_id")' deseq2.R

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /home/ganjunwei/test_rnaflow/work/c2/73965f896be4e3a6a580d917cd4cc5

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
fischer-hub commented 2 years ago

Hi @Gavin098 !

  1. That is really weird, the log files should usually be in the project directory, but the output would be similar to the one on the terminal so don't worry too much about it I would say.

  2. Yes the release parameter -r is used when one pulls the pipeline via nextflow and not manually like so:

    nextflow pull hoelzer-lab/rnaflow # maybe run 2x if the next command does not work
    nextflow run -r 1.3.0 hoelzer-lab/rnaflow -profile test,conda,local

    Like that you can have different versions of the same pipeline on your machine and select which version to run. But it seems you are running the main.nf script directly so this doesn't apply to your case. Sorry for the confusion.

  3. Okay thanks! That is quite helpful, so the pipeline indeed still crashes at the deseq2 process. Also, I've just noticed that the deseq2.Rout you posted earlier contains only the code from the deseq2.R script.


library("DESeq2")
library("RColorBrewer")
library("gplots")
library("ggplot2")
library("ReportingTools")
library("pheatmap")
library("biomaRt")
library("svglite")
library("piano")
library("apeglm")
library("EnhancedVolcano")
library("regionReport")
library("stringr")
library("WebGestaltR")
library("snowfall")
library("openxlsx")

#####################################################################################
## FUNCTIONS
#####################################################################################
build.project.structure <- function(out.dir) {
  dir.create(file.path(out.dir), showWarnings = FALSE)

Is it possible that this was a mistake? Because if the pipeline still crashes in the deseq2 process the real deseq2.Rout would tell us again why it's crashing. Like last time when we figured out a package was not available in the script. So it would be helpful if you could attach the real output of the deseq2.Rout log file so we can see whats going on.

Thanks in advance!

Gavin098 commented 2 years ago

Sorry I didn't provide full information before. It seems that there is still a problem with R, and I tried to update the software in the conda environment to run it again, but still did not solve the problem nextflow run rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --cores 40 --permanentCacheDir ../test_rnaflow/nextflow-autodownload-databases -profile local,conda

R N A F L O W : R N A - S E Q  A S S E M B L Y  &  D I F F E R E N T I A L  G E N E  E X P R E S S I O N  A N A L Y S I S
= = = = = = =   = = = = = = =  = = = = = = = =  =  = = = = = = = = = = = =  = = = =  = = = = = = = = = =  = = = = = = = =
Output path:                    results
Strandedness                    unstranded
Read mode:                      paired-end
TPM threshold:                  1
Comparisons:                    all 
Nanopore mode:                  false

executor >  local (2)
[fa/01559f] process > concat_genome                                          [100%] 1 of 1, cached: 1 ✔
[43/afa732] process > concat_annotation                                      [100%] 1 of 1, cached: 1 ✔
executor >  local (2)
[fa/01559f] process > concat_genome                                          [100%] 1 of 1, cached: 1 ✔
[43/afa732] process > concat_annotation                                      [100%] 1 of 1, cached: 1 ✔
[skipped  ] process > download_sortmerna:sortmernaGet                        [100%] 1 of 1, stored: 1 ✔
[cc/fd8d31] process > preprocess_illumina:fastqcPre (SD_REP2)                [100%] 6 of 6, cached: 6 ✔
[db/17532c] process > preprocess_illumina:fastp (SD_REP1)                    [100%] 6 of 6, cached: 6 ✔
[58/4a77ce] process > preprocess_illumina:fastqcPost (SD_REP1)               [100%] 6 of 6, cached: 6 ✔
[3b/69ab09] process > preprocess_illumina:extract_tar_bz2                    [100%] 1 of 1, cached: 1 ✔
[d9/e39b1c] process > preprocess_illumina:sortmerna (SD_REP1)                [100%] 6 of 6, cached: 6 ✔
[fe/7ae755] process > preprocess_illumina:hisat2index                        [100%] 1 of 1, cached: 1 ✔
[13/090786] process > preprocess_illumina:hisat2 (SD_REP1)                   [100%] 6 of 6, cached: 6 ✔
[83/ee5519] process > preprocess_illumina:index_bam (SD_REP1)                [100%] 6 of 6, cached: 6 ✔
[12/9bd3ac] process > expression_reference_based:featurecounts (RD_REP2)     [100%] 6 of 6, cached: 6 ✔
[b3/5a7990] process > expression_reference_based:format_annotation_gene_rows [100%] 1 of 1, cached: 1 ✔
[2e/e0bf9f] process > expression_reference_based:format_annotation           [100%] 1 of 1, cached: 1 ✔
[b7/c84cda] process > expression_reference_based:tpm_filter                  [100%] 1 of 1, cached: 1 ✔
[c9/d64fbe] process > expression_reference_based:deseq2 (1)                  [100%] 2 of 2, failed: 2, retries: 1 ✘
[12/d5f4c9] process > expression_reference_based:multiqc_sample_names (1)    [100%] 1 of 1, cached: 1 ✔
[d6/91686b] process > expression_reference_based:multiqc (1)                 [100%] 1 of 1, cached: 1 ✔
Error executing process > 'expression_reference_based:deseq2 (1)'

Caused by:
  Process `expression_reference_based:deseq2 (1)` terminated with an error exit status (1)

Command executed:  R CMD BATCH --no-save --no-restore '--args c(".") c("RD_REP1.counts.filtered.formated.tsv","RD_REP2.counts.filtered.formated.tsv","RD_REP3.counts.filtered.formated.tsv","SD_REP1.counts.filtered.formated.tsv","SD_REP2.counts.filtered.formated.tsv","SD_REP3.counts.filtered.formated.tsv") c("mock","mock","mock","treated","treated","treated") c("RD_REP1","RD_REP2","RD_REP3","SD_REP1","SD_REP2","SD_REP3") c("mock","treated") c("mock:treated") c("annotation.id2details") c("annotation.gene.gtf") c("A","B","C","A","B","C") c("") c("regionReport_DESeq2Exploration_custom.Rmd") c(40) c("ensembl_gene_id")' deseq2.R

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /home/ganjunwei/test_rnaflow6/work/c9/d64fbe214fd121640f0b2eb36349a3

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

deseq2.Rout

R version 4.0.1 (2020-06-06) -- "See Things Now"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-conda_cos6-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library("DESeq2")
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

    aperm, apply, rowsum

Warning message:
package ‘matrixStats’ was built under R version 4.0.2 
> library("RColorBrewer")
> library("gplots")

Attaching package: ‘gplots’

The following object is masked from ‘package:IRanges’:

    space

The following object is masked from ‘package:S4Vectors’:

    space

The following object is masked from ‘package:stats’:

    lowess

Warning message:
package ‘gplots’ was built under R version 4.0.2 
> library("ggplot2")
> library("ReportingTools")
Loading required package: knitr

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

> library("pheatmap")
> library("biomaRt")
> library("svglite")
Warning message:
package 'svglite' was built under R version 4.0.2 
> library("piano")
Registered S3 method overwritten by 'sets':
  method        from   
  print.element ggplot2
> library("apeglm")
> library("EnhancedVolcano")
Loading required package: ggrepel
> library("regionReport")
> library("stringr")
> library("WebGestaltR")
******************************************

*                                        *

*          Welcome to WebGestaltR !      *

*                                        *

******************************************

Warning message:
package 'WebGestaltR' was built under R version 4.0.2 
> library("snowfall")
Loading required package: snow

Attaching package: 'snow'

The following objects are masked from 'package:BiocGenerics':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, parApply, parCapply,
    parLapply, parRapply, parSapply

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

> library("openxlsx")
Warning message:
package 'openxlsx' was built under R version 4.0.2 
> 
> 
> #####################################################################################
> ## FUNCTIONS
> #####################################################################################
> build.project.structure <- function(out.dir) {
+   dir.create(file.path(out.dir), showWarnings = FALSE)
+   # Build necessary project structure
+   dir.create(file.path(out.dir, '/results'), showWarnings = FALSE)
+   dir.create(file.path(out.dir, '/reports'), showWarnings = FALSE)
+   dir.create(file.path(out.dir, '/input'), showWarnings = FALSE)
+   for (plot.type in c('volcano', 'PCA', 'heatmaps', 'MA', 'sample2sample')) {
+     dir.create(file.path(out.dir, paste0('/plots/', plot.type)), showWarnings = FALSE, recursive = TRUE)
+   }
+ }
> 
> write.table.to.file <- function(as.data.frame.object, output.path, output.name, id2name, row.names=TRUE, col.names=TRUE) {
+   output.file.basename <- paste0(output.path, "/", output.name)
+   write.table(as.data.frame.object, file=paste0(output.file.basename, ".csv"), sep = ",", row.names=row.names, col.names=col.names)
+   if( is.na(col.names) ){
+     write.xlsx(as.data.frame.object, file=paste0(output.file.basename, ".xlsx"), row.names=row.names, col.names=TRUE, asTable=TRUE)
+   } else {
+     write.xlsx(as.data.frame.object, file=paste0(output.file.basename, ".xlsx"), row.names=row.names, col.names=col.names, asTable=TRUE)
+   }
+ 
+   if ( !missing(id2name)) {
+     output.file.basename.extended <- paste0(output.path, "/", output.name, "_extended")
+     ## add real gene names and biotypes to the csv files
+     system(paste("./improve_deseq_table.rb", paste0(output.file.basename.extended, ".csv" ), paste0(output.file.basename, ".csv"), id2name, sep=" "), wait=TRUE)
+     write.xlsx(read.csv(paste0(output.file.basename.extended, ".csv" )), paste0(output.file.basename.extended, ".xlsx" ), asTable=TRUE)
+   }
+ }
> 
> plot.sample2sample <- function(out.dir, col.labels, trsf_data, trsf_type, colors) {
+   ## get sample-to-sample distances
+   sampleDists <- dist(t(assay(trsf_data)))
+   sampleDistMatrix <- as.matrix(sampleDists)
+   ## add names
+   rownames(sampleDistMatrix) <- with(colData(trsf_data), col.labels)
+   colnames(sampleDistMatrix) <- with(colData(trsf_data), col.labels)
+ 
+ 
+   pdf(paste(out.dir, paste0("sample2sample_", trsf_type, ".pdf"), sep="/"))
+   pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, color = colors)
+   dev.off()
+   svg(paste(out.dir, paste0("sample2sample_", trsf_type, ".svg"), sep="/"))
+   pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists, color = colors)
+   dev.off()
+ 
+   # sample2sample heatmap with color key and histogram
+   # hc <- hclust(sampleDists)
+   # heatmap.2(sampleDistMatrix, Rowv=as.dendrogram(hc), symm=TRUE, trace="none", col = colors, margin=c(13, 13))
+ }
> 
> plot.ma <- function(output.dir, deseq2.res, alpha) {
+   pdf(paste(output.dir, paste0("MA_alpha", alpha, ".pdf"), sep="/"))
+   plotMA(deseq2.res, alpha = alpha, main = paste('MA plot with alpha =', alpha))
+   dev.off()
+   svg(paste(output.dir, paste0("MA_alpha", alpha, ".svg"), sep="/"))
+   plotMA(deseq2.res, alpha = alpha, main = paste('MA plot with alpha =', alpha))
+   dev.off()
+ }
> 
> reportingTools.html <- function(out.dir, dds, deseq2.result, pvalueCutoff, condition1, condition2, annotation_genes, make.plots=TRUE) {
+   # Exporting results to HTML and CSV
+   if (pvalueCutoff == 1.1){
+     shortName <- 'RNAseq_analysis_with_DESeq2_full'
+     title <- paste0('RNA-seq analysis of differential expression using DESeq2, no P value cutoff')
+   } else {
+     shortName <- paste0('RNAseq_analysis_with_DESeq2_p', pvalueCutoff)
+     title <- paste0('RNA-seq analysis of differential expression using DESeq2, P value cutoff ', pvalueCutoff)
+   }
+   if (make.plots == FALSE) {
+     dir.create(file.path(paste0(out.sub, '/reports/figures', shortName)), showWarnings = FALSE)
+     for ( id in rownames(deseq2.result[ !is.na(deseq2.result$padj) & deseq2.result$padj < pvalueCutoff, ]) ) {
+       system(paste0('cp ', out.dir, '/reports/figuresRNAseq_analysis_with_DESeq2_full/boxplot.', id, '.pdf ', out.sub, '/reports/figures', shortName))
+       system(paste0('cp ', out.dir, '/reports/figuresRNAseq_analysis_with_DESeq2_full/mini.', id, '.png ', out.sub, '/reports/figures', shortName))
+     }
+   }
+   des2Report <- HTMLReport(shortName=shortName, title=title, basePath=out.dir, reportDirectory="reports/")
+   publish(dds, des2Report, pvalueCutoff=pvalueCutoff, annotation.db=NULL, factor=colData(dds)$condition, reportDir=out.dir, n=length(row.names(deseq2.result)), contrast=c("condition",condition1,condition2), make.plots=make.plots)
+   finish(des2Report)
+   system(paste('./refactor_reportingtools_table.rb', paste0(out.dir, '/reports/', shortName,'.html'), annotation_genes, 'add_plots', pvalueCutoff, sep=" "))
+ }
> 
> plot.pca <- function(out.dir, col.labels, trsf_data, trsf_type, ntop) {
+   # calculate the variance for each gene
+   rv <- rowVars(assay(trsf_data))
+   # select the ntop genes by variance
+   select <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]
+   # Extract the data 
+   X <- t(assay(trsf_data)[select,]) # Transpose this as our read count matrix as R has dimensions as columns and not as rows (thanks, R!!!)
+   
+   # Using R's internal function for improved speed (and accuracy as they use SDV)
+   # Caution: R will not consider all eigenvectors (there are thousands of genes)
+   # Theoretically, we need to calculate ALL of them (we then obtain PC1, PC2, ... PCm with m dimensions = genes)
+   # But R will truncate it to PC1, PC2, ... PCn with n data points (if n < m), which is fast.
+   # Don't let this confuse you
+   pca <- prcomp(X, center = TRUE, scale = FALSE) # default: center = TRUE, scale = FALSE
+   
+   # the contribution to the total variance for each component
+   percentVar <- pca$sdev^2 / sum( pca$sdev^2 )
+   
+   intgroup <- c("condition")
+   if (!all(intgroup %in% names(colData(trsf_data)))) {
+     stop("the argument 'intgroup' should specify columns of colData(dds)")
+   }
+   intgroup.df <- as.data.frame(colData(trsf_data)[, intgroup, drop=FALSE])
+ 
+   # add the intgroup factors together to create a new grouping factor
+   group <- if (length(intgroup) > 1) {
+     factor(apply( intgroup.df, 1, paste, collapse=":"))
+   } else {
+     colData(trsf_data)[[intgroup]]
+   }
+ 
+   d <- data.frame(PC1=pca$x[,1], PC2=pca$x[,2], group=group, intgroup.df, name=col.labels)
+ 
+   ggplot(data=d, aes_string(x="PC1", y="PC2", colour="condition")) +
+     geom_point(size=3) + 
+     xlab(paste0("PC1: ",round(percentVar[1] * 100),"% variance")) +
+     ylab(paste0("PC2: ",round(percentVar[2] * 100),"% variance")) +
+     ggtitle(paste("PC1 vs PC2: top ", ntop, " variable genes")) +
+     coord_fixed() +
+     ggsave(paste(out.dir, paste0("PCA_simple_", trsf_type, "_top", ntop, ".pdf"), sep="/"))
+     ggsave(paste(out.dir, paste0("PCA_simple_", trsf_type, "_top", ntop, ".svg"), sep="/"))
+ }
> 
> plot.heatmap.most_var <- function(out.dir, dds, trsf_data, trsf_type, ntop, samples.info=df.samples.info, genes.info=df.gene.anno) {
+   select <- order(rowVars(counts(dds,normalized=TRUE)),decreasing=TRUE)
+   select <- select[1:min(ntop, length(select))][1:min(ntop, length(select))]
+   selected.ids <- row.names(trsf_data[select,])
+   if ( length(selected.ids) > 1 ) {
+     file <- paste(out.dir, paste0("heatmap_count_matrix_", trsf_type, "_mostVar", ntop, "_row-scaled.pdf"), sep="/")
+     pheatmap(assay(trsf_data)[select,], cluster_cols = FALSE, cluster_rows = TRUE,
+           scale = "row", border_color = NA,
+           labels_row = as.character(genes.info[selected.ids,]$gene_type),
+           annotation_col=samples.info[ , !(colnames(samples.info) == 'columns'), drop=FALSE],
+           labels_col = as.character(samples.info[colnames(trsf_data),]$columns),
+           height = 12, width = 8, file = file)
+   } else {
+     print('SKIPPING: plot.heatmap.most_var. Only one feature to plot.')
+   }
+ }
> 
> plot.heatmap.top_counts <- function(out.dir, dds, trsf_data, trsf_type, ntop, samples.info=df.samples.info, genes.info=df.gene.anno) {
+   select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)
+   select <- select[1:min(ntop, length(select))]
+   selected.ids <- row.names(counts(dds,normalized=TRUE)[select,])
+   if ( length(selected.ids) > 1 ) {
+     file <- paste(out.dir, paste0("heatmap_count_matrix_", trsf_type, "_top", ntop, "Counts_row-scaled.pdf"), sep="/")
+     pheatmap(assay(trsf_data)[select,], cluster_cols = FALSE, cluster_rows = TRUE,
+             scale = "row", border_color = NA,
+             labels_row = as.character(genes.info[selected.ids,]$gene_type),
+             annotation_col=samples.info[ , !(colnames(samples.info) == 'columns'), drop=FALSE],
+             labels_col = as.character(samples.info[colnames(trsf_data),]$columns),
+             height = 12, width = 8, file = file)
+   } else {
+     print('SKIPPING: plot.heatmap.top_counts. Only one feature to plot.')
+   }
+ }
> 
> plot.heatmap.top_fc <- function(out.dir, resFold, trsf_data, trsf_type, ntop, pcutoff='', samples.info=df.samples.info, genes.info=df.gene.anno) {
+   selected.ids <- row.names(resFold[order(resFold$log2FoldChange, decreasing=TRUE), ])
+   selected.ids <- selected.ids[1:min(ntop, length(selected.ids))]
+   if ( length(selected.ids) > 1 ) {
+     file <- paste(out.dir, paste0("heatmap_count_matrix_", trsf_type, "_top", ntop, "log2FC", pcutoff, "_row-scaled.pdf"), sep="/")
+     pheatmap(assay(trsf_data)[selected.ids,], cluster_cols = FALSE, cluster_rows = TRUE, 
+             scale = "row", border_color = NA, 
+             labels_row = as.character(genes.info[selected.ids,]$gene_type),
+             annotation_col=samples.info[ , !(colnames(samples.info) == 'columns'), drop=FALSE],
+             labels_col = as.character(samples.info[colnames(trsf_data),]$columns),
+             height = 12, width = 8, file = file)
+   } else {
+     print('SKIPPING: plot.heatmap.top_fc. Only one feature to plot.')
+   }
+ }
> 
> piano <- function(out.dir, resFold, mapGO, cpus) {
+   mapGO <- mapGO[mapGO[,2]!="",]
+   write.table.to.file(mapGO, out.dir, "ENSG_GOterm", row.names = FALSE)
+ 
+   myGsc <- loadGSC(mapGO)
+ 
+   myPval <- resFold$padj
+   names(myPval) <- rownames(resFold)
+   myFC <- resFold$log2FoldChange
+   names(myFC) <- rownames(resFold)
+   
+   if (cpus >= 10) {
+     piano_cpus = 10
+   } else {
+     piano_cpus = 1
+   }
+ 
+   gene.set.min <- 20
+   gene.set.max <- 'inf' # 9999999999999
+   gsaRes1 <- runGSA(myFC, geneSetStat="maxmean", gsc=myGsc,
+                   gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
+   gsaRes2 <- runGSA(myFC, geneSetStat="gsea", gsc=myGsc,
+                   gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
+   gsaRes3 <- runGSA(myFC, geneSetStat="fgsea", gsc=myGsc,
+                   gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
+   gsaRes4 <- runGSA(myFC, geneSetStat="page", gsc=myGsc,
+                   gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
+   gsaRes5 <- runGSA(myPval, myFC, geneSetStat="fisher", gsc=myGsc,
+                   gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
+   gsaRes6 <- runGSA(myPval, myFC, geneSetStat="stouffer", gsc=myGsc,
+                   gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
+   gsaRes7 <- runGSA(myPval, myFC, geneSetStat="reporter", gsc=myGsc,
+                   gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
+   gsaRes8 <- runGSA(myPval, myFC, geneSetStat="tailStrength", gsc=myGsc,
+                   gsSizeLim=c(gene.set.min,gene.set.max), ncpus=piano_cpus)
+ 
+   resList <- list(gsaRes1,gsaRes2,gsaRes3,gsaRes4,gsaRes5,gsaRes6,gsaRes7,gsaRes8)
+   names(resList) <- c("maxmean", "gsea", "fgsea", "page", "fisher", "stouffer", "reporter", "tailStrength")
+ 
+   try.piano <- try( {
+     pdf(paste(out.dir,"/consensus_heatmap.pdf",sep=""), width = 10, height = 10)
+     ch <- consensusHeatmap(resList,cutoff=10,method="mean",colorkey=FALSE,cellnote="consensusScore",ncharLabel = 120) ## medianPvalue or consensusScore or nGenes
+     dev.off()
+     svg(paste(out.dir,"/consensus_heatmap.svg",sep=""), width = 10, height = 10)
+     ch <- consensusHeatmap(resList,cutoff=10,method="mean",colorkey=FALSE,cellnote="consensusScore",ncharLabel = 120) ## medianPvalue or consensusScore
+     dev.off()
+ 
+     downregulated_paths <- as.data.frame(ch$pMat[,1][ch$pMat[,1] < 0.05])
+     upregulated_paths <- as.data.frame(ch$pMat[,5][ch$pMat[,5] < 0.05])
+ 
+     write.table.to.file(downregulated_paths, out.dir, "paths_sigdown", col.names=FALSE)
+     write.table.to.file(upregulated_paths, out.dir, "paths_sigup", col.names=FALSE)
+     }
+   )
+   if (class(try.piano) == "try-error") {
+     print('SKIPPING: piano consensusHeatmap.')
+   }
+ 
+   # for (i in 1:length(resList)){
+   #   svg(paste(out.dir, paste0(names(resList)[i], '.svg'), sep='/'), width = 10, height = 10)
+   #   networkPlot(resList[[i]], class="non")
+   #   dev.off()
+   # }
+ }
> 
> ##################### TODO
> # plot.ma.go <- function(out, deseq2.res, ma.size, results.gene, go.terms, trsf_data, trsf_type) {
> #   ## We can also make an MA-plot for the results table in which we raised
> #   ## the log2 fold change threshold (Figure below). We can label individual
> #   ## points on the MA-plot as well. Here we use the with R function to plot
> #   ## a circle and text for a selected row of the results object. Within the
> #   ## with function, only the baseMean and log2FoldChange values for the
> #   ## selected rows of res are used.
> #   ##-----------------------------
> #   for (go.term.ma in go.terms) {
> #     #go.term.ma <- "GO:0009615"
> #     pdf(paste(out,"statistics/ma_",trsf_type,"_", gsub(":", "", go.term.ma), ".pdf",sep=""))
> #     plotMA(deseq2.res, main=paste("DESeq2, ", go.term.ma, sep=''), ylim=ma.size)
> #     results.gene.GO.ma <- grep(go.term.ma, results.gene$go_id, fixed=TRUE)  ### e.g. GO:0002376, immune system process in mice
> #     trsf_data.go.ma <- rownames(assay(trsf_data)[results.gene[results.gene.GO.ma,]$ensembl_gene_id,]) # get the ensembl ids corresponding to this go term
> #     for (gene in trsf_data.go.ma) {
> #       index = which(ensembl.ids == gene)
> #       gene.name <- toString(gene.ids[index])
> #       with(deseq2.res[gene, ], {
> #         if (gene %in% rownames(resFold05)) {
> #           points(baseMean, log2FoldChange, col="dodgerblue", cex=0.8, lwd=2, bg="dodgerblue")
> #           text(baseMean, log2FoldChange, gene.name, pos=2, col="dodgerblue")
> #         }
> #       })
> #     }
> #     dev.off()
> #   }
> # }
> #####################################################################################
> ## END FUNCTIONS
> #####################################################################################
> 
> 
> #####################################################################################
> ## MAIN 
> #####################################################################################
> 
> ##########################################
> ## Preparation
> ##########################################
> 
> #####################
> ## Parse arguments
> 
> args <- commandArgs(TRUE) # Read the arguments passed from the command line and assigns them to a vector of characters
> 
> ## Parse the arguments (in characters) and evaluate them
> project_dir <- eval( parse(text=args[1]) )[1] 
> samples <- eval( parse(text=args[2]) )
> conditions <- eval( parse(text=args[3]) )
> col.labels <- eval( parse(text=args[4]) )
> levels <- eval( parse(text=args[5]) )
> comparisons <- eval( parse(text=args[6]) )
> id2name <- eval( parse(text=args[7]) )[1]
> annotation_genes <- eval( parse(text=args[8]) )[1]
> sources <- eval( parse(text=args[9]) )
> species <- eval( parse(text=args[10]) )
> regionReport_config  <- eval( parse(text=args[11]) )[1]
> regionReport_config <- normalizePath(regionReport_config) # regionReport needs the absolute path
> cpus <- eval( parse(text=args[12]) )
> id_type <- eval( parse(text=args[13]) )
> #go.terms <- c()
> #go.terms <- eval( parse(text=args[12]) ) # c("GO:004563","GO:0011231",...)
> 
> #####################
> ## Read in ensembl ids, gene names and biotypes from a tab seperated table
> df.gene.anno <- as.data.frame( read.table(id2name, header=FALSE, sep="\t") )
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': no lines available in input
Calls: as.data.frame -> read.table
Execution halted
fischer-hub commented 2 years ago

Hey @Gavin098 ,

According to the new error message from your deseq2.Rout log file, there seems to be an issue with your annotation file(s). The id2name dataframe is supposed to contain gene IDs, gene names and biotypes extracted from the annotation files you provided via the --annotation flag. However, the dataframe seems to be empty, resulting in no lines available in input.

## Read in ensembl ids, gene names and biotypes from a tab seperated table
df.gene.anno <- as.data.frame( read.table(id2name, header=FALSE, sep="\t") )
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': no lines available in input

I would suggest you have a look at the deseq2 process work directory again (/home/ganjunwei/test_rnaflow6/work/c9/d64fbe214fd121640f0b2eb36349a3) and see if the file ending in .id2details is empty. If its not empty I attached an example file of how it should look like (structure wise) so you can check if there is something off in your *.id2details file: example.id2details.txt

If the file is empty, please check if you provided the correct paths to the annotation files in the --annotation gtfs.csv. If these paths are correct finally please also check that your annotation files are in gtf format and not empty.

Please feel free to reach out again if none of the above works! (Also if it works of course)

Cheers

Gavin098 commented 2 years ago

thanks for your feedback! I tried two different annotation files and the problem still arises. but annotation.id2detailsandannotation.gene.gtf are still empty. I don't understand whether I set these two files correctly or if there is a error with my annotation files. fastas.csv: /home/public/moso_genome_V3/contig/genome.fa gtfs.csv: /home/public/moso_genome_V3/contig/Moso_V3.BestGeneModels.gtf Moso_V3.BestGeneModels.gtf:

tig00000007_np1212121212        BestGeneModels  start_codon     35762   35764   .       +       0       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  exon    35762   35788   .       +       .       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  CDS     35762   35788   0.37    +       0       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  exon    36357   36440   .       +       .       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  CDS     36357   36440   0.84    +       0       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  exon    38043   38180   .       +       .       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  CDS     38043   38180   0.96    +       0       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  exon    38274   38396   .       +       .       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  CDS     38274   38396   1       +       0       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";
tig00000007_np1212121212        BestGeneModels  exon    38489   38587   .       +       .       gene_id "MB01Gene00001"; transcript_id "MB01Gene00001.t1"; gene_name "MB01Gene00001";

Error in deseq2.Rout:

Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': no lines available in input
Calls: as.data.frame -> read.table
Execution halted
fischer-hub commented 2 years ago

Hi @Gavin098 ,

The fasta.csv and gtfs.csv seem to be correct to me if the paths are correct. Also the .gtf format seems to be ok. However, I found a bug in the code thanks to your issue! Your .gtf file seems to contain no features of type gene (feature type is listed in column 3 of the .gtf file). However, the pipeline was looking for features of type gene in the .gtf file. Usually this can be changed vie the flag --featurecounts_additional_params -t gene -g gene_id, where -t can specify the type(s) of features to count and filter for. While the default value here is to look for features of type exon anyway, due to the bug in the code this was overwritten and would later filter for feature type gene instead.

Long story short, I just pushed a fix to a new branch and I would suggest you try to run again with the new fix. Just pull the branch like this:

git clone -b fix_anno_feature_type --single-branch https://github.com/hoelzer-lab/rnaflow.git

and run your command as usual. Also if you are familiar with what feature types you want to consider in your analysis you can now change them with the above mentioned --featurecounts_additional_params -t gene -g gene_id flag. If not, just keep the old command ;).

I hope this fixes the issue, if not please tell me about it,

have a nice weekend!

Gavin098 commented 2 years ago

thanks for your help. I have succeeded in getting the results. One last question, I want to ask about the meaning of the file name: mostVar, Counts_row-scaled, log2FC_row-scaled, log2FCpcutoff0-01, log2FCpcutoff0-05

image

Have a nice weekend!

hoelzer commented 2 years ago

Hey @Gavin098 glad it finally worked! And thx @fischer-hub for the support!

The different file names correspond to different filter and normalization options for plotting expression data on a heatmap.

It's difficult to provide a single or only a few default plottings for the broad variety of RNA-Seq experiments. Thus we decided to plot a couple of different parameter combinations to look at. However, it's totally possible that it makes sense for you to take the count matrix and do some additional plots focusing on your biological question and setup.

Gavin098 commented 2 years ago

Well, thank you for your time