nf-core / circrna

circRNA quantification, differential expression analysis and miRNA target prediction of RNA-Seq data
https://nf-co.re/circrna
MIT License
43 stars 21 forks source link

When running on the sponging branch, there is an error in the annotation part. #170

Closed xfk274280 closed 1 week ago

xfk274280 commented 3 weeks ago

Description of the bug

image Here are my annotation file and bed file. Are the formats correct?

Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:BSJ_DETECTION:ANNOTATE_PER_SAMPLE:ANNOTATE (C23)'

Caused by:
  Process `NFCORE_CIRCRNA:CIRCRNA:BSJ_DETECTION:ANNOTATE_PER_SAMPLE:ANNOTATE (C23)` terminated with an error exit status (1)

Command executed [~/workspace/nf_core/nf-core-circrna_sponging/sponging/./workflows/circrna/../../subworkflows/local/./../../modules/local/annotation/templates/annotation.py]:

  #!/usr/bin/env python

  import pandas as pd
  import numpy as np
  import platform
  import csv

  def format_yaml_like(data: dict, indent: int = 0) -> str:
      """Formats a dictionary to a YAML-like string.

      Args:
          data (dict): The dictionary to format.
          indent (int): The current indentation level.

      Returns:
          str: A string formatted as YAML.
      """
      yaml_str = ""
      for key, value in data.items():
          spaces = "  " * indent
          if isinstance(value, dict):
              yaml_str += f"{spaces}{key}:\n{format_yaml_like(value, indent + 1)}"
          else:
              yaml_str += f"{spaces}{key}: {value}\n"
      return yaml_str

  columns = {
      0: 'chr',
      1: 'start',
      2: 'end',
      3: 'name',
      4: 'score',
      5: 'strand',
      9: 'tx_start',
      10: 'tx_end',
      14: 'attributes'
  }

  attributes = ['gene_id', 'gene_name', 'transcript_id']

  exon_boundary = int("0")

  try:
      df = pd.read_csv("C23.intersect_gtf.bed", sep="\t", header=None, usecols=columns.keys())
  except pd.errors.EmptyDataError:
      raise ValueError("Intersection between circRNAs and GTF file is empty.")
  df = df.rename(columns=columns)

  # Extract circRNAs without match
  mask = df['tx_start'] == -1
  df_intergenic = df[mask]
  df = df[~mask]
  df_intergenic['type'] = 'intergenic-circRNA'
  df_intergenic['gene_id'] = 'intergenic_' + df_intergenic['name']
  df_intergenic['gene_name'] = 'intergenic_' + df_intergenic['name']
  df_intergenic['transcript_id'] = 'intergenic_' + df_intergenic['name']

  # Convert attributes to a dictionary
  df['attributes'] = df['attributes'].apply(lambda row: dict([[value.strip(r'"') for value in entry.strip().split(' ', 1)] for entry in row.split(';') if entry]))
  # Make sure all attributes are present
  df_incomplete = df['attributes'].apply(lambda row: ", ".join([key for key in attributes if key not in row]))
  df_incomplete = df_incomplete[df_incomplete != ""]
  if len(df_incomplete) > 0:
      counts = df_incomplete.value_counts()
      counts.name = 'count'
      counts.index.name = 'missing'
      raise ValueError(f"The following attributes are missing in the intersection file:\n\n{counts.to_frame()}")
  # Keep only the attributes we want
  df['attributes'] = df['attributes'].apply(lambda row: {key: row[key] for key in attributes if key in row})
  # Convert attributes to columns
  df = pd.concat([df.drop(['attributes'], axis=1), df['attributes'].apply(pd.Series)], axis=1)

  df['any_outside'] = (df['start'] < df['tx_start'] - exon_boundary) | (df['end'] > df['tx_end'] + exon_boundary)
  # Perfect is inverse of any_outside
  df['perfect'] = ~df['any_outside']
  # Drop any_outside
  df = df.drop(['any_outside', 'tx_start', 'tx_end'], axis=1)

  df = df.groupby(['chr', 'start', 'end', 'strand']).aggregate({
      'name': lambda x: x.iloc[0],
      'score': lambda x: x.iloc[0],
      'gene_id': lambda x: list(x),
      'gene_name': lambda x: list(x),
      'transcript_id': lambda x: list(x),
      'perfect': lambda x: list(x)
  })

  def filter_perfect(row, col):
      if any(row['perfect']):
          matching_values = [value for value, perfectness in zip(row[col], row['perfect']) if perfectness]
      else:
          matching_values = row[col]
      valid_values = set([value for value in matching_values if type(value) == str])
      return ",".join(valid_values) if valid_values else "NaN"

  def determine_type(row):
      if row["no_transcript"]:
          return "ciRNA"
      if any(row['perfect']):
          return "circRNA"
      else:
          return 'EI-circRNA'

  df['no_transcript'] = df['transcript_id'].apply(lambda x: all([type(value) != str and np.isnan(value) for value in x]))
  df['type'] = df.apply(lambda row: determine_type(row), axis=1)
  df['gene_id'] = df.apply(lambda row: filter_perfect(row, 'gene_id'), axis=1)
  df['gene_name'] = df.apply(lambda row: filter_perfect(row, 'gene_name'), axis=1)
  df['transcript_id'] = df.apply(lambda row: filter_perfect(row, 'transcript_id'), axis=1)
  # Drop perfect
  df = df.drop(['perfect'], axis=1)

  df = df.reset_index()
  df_intergenic = df_intergenic.reset_index()
  bed_order = ['chr', 'start', 'end', 'name', 'score', 'strand', 'type', 'gene_id', 'gene_name', 'transcript_id']
  df = df[bed_order]
  df_intergenic = df_intergenic[bed_order]

  df = pd.concat([df, df_intergenic], axis=0)

  db_intersections = "C23-circBase.intersect_database.bed".split()
  has_db = len(db_intersections) > 0

  if has_db:
      db_colnames = ['chr', 'start', 'end', 'name', 'score', 'strand', 'db_chr', 'db_start', 'db_end', 'db_name', 'db_score', 'db_strand']
      db_usecols = ['chr', 'start', 'end', 'name', 'score', 'strand', 'db_name']
      df_databases = pd.concat([pd.read_csv(db_path, sep="\t", names=db_colnames, usecols=db_usecols) for db_path in db_intersections])

      # Group by chr, start, end, name, score, strand, and aggregate the db_name to list
      df_databases = df_databases.groupby(['chr', 'start', 'end', 'name', 'score', 'strand']).aggregate({
          'db_name': lambda x: ",".join([val for val in x if val != '.'])
      })

      df_databases['db_name'] = df_databases['db_name'].apply(lambda x: x if x else '.')

      df = df.merge(df_databases, how='left', on=['chr', 'start', 'end', 'name', 'score', 'strand'])
  else:
      df['db_name'] = "."

  # Sort by chr, start, end
  df = df.sort_values(['chr', 'start', 'end'])

  df.to_csv("C23.annotated.bed", sep='\t', index=False, header=False)

  # Convert to GTF
  df['source'] = 'circRNA'
  df['frame'] = '.'
  df['attributes'] = 'gene_id "' + df['gene_id'] + '"; gene_name "' + df['gene_name'] + '"; transcript_id "circ_' + df['name'] + '"; db_ids "' + df['db_name'] + '";'

  gtf_order = ['chr', 'source', 'type', 'start', 'end', 'score', 'strand', 'frame', 'attributes']
  df = df[gtf_order]

  df.to_csv("C23.annotated.gtf", sep='\t', index=False, header=False, quoting=csv.QUOTE_NONE)

  # Versions

  versions = {
      "NFCORE_CIRCRNA:CIRCRNA:BSJ_DETECTION:ANNOTATE_PER_SAMPLE:ANNOTATE": {
          "python": platform.python_version(),
          "pandas": pd.__version__,
          "numpy": np.__version__
      }
  }

  with open("versions.yml", "w") as f:
      f.write(format_yaml_like(versions))

Command exit status:
  1

Command output:
  (empty)

Command error:
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 44 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 45 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 46 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 47 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 48 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 49 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 50 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 51 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 52 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 53 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 54 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 55 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 56 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 57 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 58 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 59 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 60 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 61 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 62 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  OpenBLAS blas_thread_init: pthread_create failed for thread 63 of 64: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 1802648 max
  .command.run: fork: retry: Resource temporarily unavailable
  .command.run: fork: retry: Resource temporarily unavailable
  .command.run: fork: retry: Resource temporarily unavailable
  .command.sh:45: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
    df = pd.read_csv("C23.intersect_gtf.bed", sep="\t", header=None, usecols=columns.keys())
  .command.run: fork: retry: Resource temporarily unavailable
  .command.run: fork: Resource temporarily unavailable
  .command.run: line 158:    37 Segmentation fault      /usr/bin/env python .command.sh
  .command.run: line 158: kill: (39) - No such process

Command used and terminal output

nextflow run nf-core-circrna_sponging/sponging --input samplecircrna.bk.csv --phenotype samplecircrna_condition.bk.csv --outdir result240812 -profile singularity --fasta Genome/GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa --gtf Genome/GRCh38/Homo_sapiens.GRCh38.101.chr.gtf --skip_trimming --limitSjdbInsertNsj 10000000 -resume --max_cpus 31 --max_time '960.h' --annotation circrna_ann.csv

Relevant files

No response

System information

No response

nictru commented 3 weeks ago

Hey, you seem to have a problem similar to this. I think this is more a problem with your server than with the pipeline.

Also, is there a reason why you use the sponging? It is not ready for productive usage yet.

xfk274280 commented 2 weeks ago

I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out.Based on the previous tips, I re-ran the program for the 'annotation' step and it seems to have succeeded, but other errors occurred.

Execution cancelled -- Finishing pending tasks before exit
-[nf-core/circrna] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST'

Caused by:
  Process `NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST` terminated with an error exit status (1)

Command executed [~/nf_core/nf-core-circrna_sponging/sponging/./workflows/circrna/../../subworkflows/local/../../modules/local/circtest/circtest/templates/circtest.R]:

  #!/usr/bin/env Rscript

  require(aod)
  require(plyr)
  require(ggplot2)

  ## CircTest functions
  ## Package: CircTest (https://github.com/dieterich-lab/CircTest)
  ## Version: 0.1.1
  ## Author(s): Jun Cheng, Tobias Jakobi
  ## License: GPL

  ## SUMMMARY

  #'@title  Summarizes data
  ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
  ##   data: a data frame.
  ##   measurevar: the name of a column that contains the variable to be summariezed
  ##   groupvars: a vector containing names of columns that contain grouping variables
  ##   na.rm: a boolean that indicates whether to ignore NA's
  ##   conf.interval: the percent range of the confidence interval (default is 95%)
  #'
  summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                          conf.interval=.95, .drop=TRUE) {

      require(plyr)

      # New version of length which can handle NA's: if na.rm==T, don't count them
      length2 <- function (x, na.rm=FALSE) {
          if (na.rm) sum(!is.na(x))
          else       length(x)
      }

      # This does the summary. For each group's data frame, return a vector with
      # N, mean, and sd
      datac <- ddply(data, groupvars, .drop=.drop,
                      .fun = function(xx, col) {
                          c(N    = length2(xx[[col]], na.rm=na.rm),
                          mean = mean   (xx[[col]], na.rm=na.rm),
                          sd   = sd     (xx[[col]], na.rm=na.rm)
                      )
                      },
                      measurevar
      )

      # Rename the "mean" column
      datac <- rename(datac, c("mean" = measurevar))

      datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean

      # Confidence interval multiplier for standard error
      # Calculate t-statistic for confidence interval:
      # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
      ciMult <- qt(conf.interval/2 + .5, datac$N-1)
      datac$ci <- datac$se * ciMult

      return(datac)
  }

  ## RATIO PLOT

  Circ.ratioplot <- function(Circ,Linear,CircCoordinates = None,plotrow='1',size=24,ncol=2,groupindicator1=NULL,groupindicator2=NULL,x='Conditions',y='circRNA/(circRNA+Linear)',lab_legend='groupindicator1', circle_description = c(1:3), gene_column = None, y_axis_range = 1, colour_mode = "colour"){

      if( !is.null(groupindicator1) & length(groupindicator1) != ncol(Circ)-length(circle_description) ){
          stop("If provided, the length of groupindicator1 should be equal to the number of samples.")
      }
      if( !is.null(groupindicator2) & length(groupindicator2) != ncol(Circ)-length(circle_description) ){
          stop("If provided, the length of groupindicator2 should be equal to the number of samples.")
      }
      if(is.null(groupindicator1)){
          stop("At least one grouping should be provided through groupindicator1.")
      }
      if(!is.null(groupindicator2)){
          twolevel <- TRUE
      }else{
          twolevel <- FALSE
      }

      rownames.circ <- rownames(Circ)
      Circ <- data.frame(lapply(Circ, as.character), stringsAsFactors=FALSE)
      rownames(Circ) <- rownames.circ

      rownames.linear <- rownames(Linear)
      Linear <- data.frame(lapply(Linear, as.character), stringsAsFactors=FALSE)
      rownames(Linear) <- rownames.linear

      if(!missing(CircCoordinates)){
          rownames.circCoordinates <- rownames(CircCoordinates)
          CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
          rownames(CircCoordinates) <- rownames.circCoordinates
      }else{
          CircCoordinates <- data.frame(Circ[,circle_description])
          rownames(CircCoordinates) <- rownames.circ
          rownames.circCoordinates <- rownames(CircCoordinates)
          CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
          rownames(CircCoordinates) <- rownames.circCoordinates
      }

      groupindicator1 <- factor(groupindicator1,levels=unique(groupindicator1))
      groupindicator2 <- factor(groupindicator2,levels=unique(groupindicator2))

      # Get gene name, if no annotation, output NULL
      if (is.character(plotrow)){
          if ( ! plotrow %in% rownames(CircCoordinates) ){
              stop("Specified 'plotrow' not found.")
          }
      }else{
          if ( is.numeric(plotrow) ){
              if ( ! plotrow %in% 1:nrow(CircCoordinates) ){
                  stop("Specified 'plotrow' not found.")
              }
          }else{
              stop("Specified plotrow should be ONE rowname or ONE rownumber.")
          }
      }
      # Choose your own column containing the gene name using gene_column. The genename will be displayed in the plot title if available
      if (missing(gene_column)){
          genename = NULL
      }else{
          genename <- as.character(CircCoordinates[plotrow,gene_column])
          if (genename == '.'){
              genename = NULL
          }
      }
      if(twolevel){
          plotdat <- summarySE( data.frame(Ratio=as.numeric(Circ[plotrow,-circle_description])/(as.numeric(Linear[plotrow,-circle_description])+as.numeric(Circ[plotrow,-circle_description])),
                                                                          groupindicator1,
                                                                          groupindicator2),
                                                  measurevar='Ratio',groupvars=c('groupindicator1','groupindicator2') )
      }else{
          plotdat <- summarySE( data.frame(Ratio=as.numeric(Circ[plotrow,-circle_description])/(as.numeric(Linear[plotrow,-circle_description])+as.numeric(Circ[plotrow,-circle_description])),
                                                                          groupindicator1),
                                                                          measurevar='Ratio',groupvars=c('groupindicator1') )
      }
  # construct plot
      Q <- ggplot(plotdat, aes(x=groupindicator1, y=Ratio)) +
              geom_boxplot() + theme_classic() +
              theme(axis.text.x = element_blank())+
              theme(axis.text.y = element_text(size=size+4))+
              theme(axis.ticks = element_line(colour = 'black', size = 1)) +
              theme(axis.ticks.x = element_blank())+
              theme(legend.title=element_blank()) +
              theme(text=element_text(size=size+4))+
              theme(legend.text=element_text(size=size)) +
              theme(plot.title = element_text(size=size)) +
              theme(axis.text.y = element_text(margin=margin(5,5,10,5,"pt")))+
              ggtitle(paste("Annotation: ", genename, "\nChr ", toString(Circ[plotrow,circle_description]),sep="")) +
              ylab("circRNA/(circRNA + Linear RNA)") +
              xlab("Sample") +
              geom_errorbar(aes(ymin=Ratio, ymax=Ratio+se), width=.2 , size=2) +
              geom_bar(stat="identity",aes(fill=groupindicator1), color = "black", size=2)

      if (colour_mode == "bw"){
              Q <- Q + scale_fill_grey(start = 0.0, end = 1)
      } else {
              Q <- Q + scale_fill_discrete(name=lab_legend)
      }

              Q <- Q +
              theme(legend.position="bottom") +
              theme(axis.ticks.length = unit(0.5, "cm")) +
              theme(panel.background = element_blank(),
                  panel.grid.major = element_blank(),
                  panel.grid.minor = element_blank(),
                  axis.line = element_line(colour = "black"),
                  panel.border = element_rect(colour = "black", fill=NA, size=3)) +
                  guides(fill=guide_legend(
                                  keywidth=0.3,
                                  keyheight=0.3,
                                  default.unit="inch")
              ) + scale_y_continuous(expand=c(0,0), limits= c(0, y_axis_range))

      if(twolevel){
          Q <- Q + facet_wrap( ~ groupindicator2,ncol=ncol )
      }

      print(Q)
  }

  ## LINEPLOT

  Circ.lineplot <- function(Circ,Linear,CircCoordinates = None,plotrow='1',size=18,ncol=2,groupindicator1=NULL,groupindicator2=NULL,x='Conditions',y='Counts', circle_description = c(1:3), gene_column = None){

      require(ggplot2)

      if( !is.null(groupindicator1) & length(groupindicator1) != ncol(Circ)-length(circle_description) ){
          stop("If provided, the length of groupindicator1 should be equal to the number of samples.")
      }
      if( !is.null(groupindicator2) & length(groupindicator2) != ncol(Circ)-length(circle_description) ){
          stop("If provided, the length of groupindicator2 should be equal to the number of samples.")
      }
      if(is.null(groupindicator1)){
          stop("At least one grouping should be provided through groupindicator1.")
      }
      if(!is.null(groupindicator2)){
          twolevel <- TRUE
      }else{
          twolevel <- FALSE
      }

      rownames.circ <- rownames(Circ)
      Circ <- data.frame(lapply(Circ, as.character), stringsAsFactors=FALSE)
      rownames(Circ) <- rownames.circ

      rownames.linear <- rownames(Linear)
      Linear <- data.frame(lapply(Linear, as.character), stringsAsFactors=FALSE)
      rownames(Linear) <- rownames.linear

      # if CircCoordinates are available, use them, otherwise get more information from the Circ table, as indicated by the circle_description columns.
      if(!missing(CircCoordinates)){
          rownames.circCoordinates <- rownames(CircCoordinates)
          CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
          rownames(CircCoordinates) <- rownames.circCoordinates
      }else{
          CircCoordinates <- data.frame(Circ[,circle_description])
          rownames(CircCoordinates) <- rownames.circ
          rownames.circCoordinates <- rownames(CircCoordinates)
          CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
          rownames(CircCoordinates) <- rownames.circCoordinates
      }

      groupindicator1 <- factor(groupindicator1,levels=unique(groupindicator1))
      groupindicator2 <- factor(groupindicator2,levels=unique(groupindicator2))

      # Get gene name, if no annotation, output NULL
      if (is.character(plotrow)){
          if ( ! plotrow %in% rownames(CircCoordinates) ){
              stop("Specified 'plotrow' not found.")
          }
      }else{
          if ( is.numeric(plotrow) ){
              if ( ! plotrow %in% 1:nrow(CircCoordinates) ){
                  stop("Specified 'plotrow' not found.")
              }
          }else{
              stop("Specified plotrow should be ONE rowname or ONE rownumber.")
          }
      }
      # Choose your own column containing the gene name using gene_column. The genename will be displayed in the plot title if available
      if (missing(gene_column)){
          genename = NULL
      }else{
          genename <- as.character(CircCoordinates[plotrow,gene_column])
          if (genename == '.'){
              genename = NULL
          }
      }

      plot.func <- function(row=plotrow){
          if(twolevel){
              plotdat <- summarySE(data.frame(Counts=c(as.numeric(Circ[row,-circle_description]),as.numeric(Linear[row,-circle_description])),
                                                                              groupindicator1,
                                                                              groupindicator2,
                                                                              Type=c(rep('circRNA',ncol(Circ)-length(circle_description)),rep('linear RNA',ncol(Circ)-length(circle_description)))
              ), measurevar='Counts',groupvars=c('Type','groupindicator1','groupindicator2') )
          }else{
              plotdat <- summarySE(data.frame(Counts=c(as.numeric(Circ[row,-circle_description]),as.numeric(Linear[row,-circle_description])),
                                                                              groupindicator1,
                                                                              Type=c(rep('circRNA',ncol(Circ)-length(circle_description)),rep('linear RNA',ncol(Circ)-length(circle_description)))
              ), measurevar='Counts',groupvars=c('Type','groupindicator1') )
          }

          Q=ggplot(plotdat, aes(x=groupindicator1, y=Counts, group=Type,colour=Type)) +
              theme(text=element_text(size=size))+
              theme_bw()+
              labs( list(title=paste(toString(Circ[row,circle_description]),genename,sep=" "),x=x,y=y) ) +
              ggtitle(paste(toString(Circ[row,circle_description]),genename,sep=" "))+
              geom_errorbar(aes(ymin=Counts-se, ymax=Counts+se), width=.1, position=position_dodge(.1) ) +
              xlab("Condition") +
              geom_line(position=position_dodge(.1)) +
              geom_point(position=position_dodge(.1))
          if (twolevel){
              Q = Q + facet_wrap( ~ groupindicator2,ncol=ceiling(sqrt(length(levels(groupindicator2)))) )
          }

          print(Q)
      }

      return(plot.func(row=plotrow))
  }

  ## FILTER

  Circ.filter <- function(circ=circ,linear=linear,Nreplicates=3,filter.sample=4,filter.count=5,percentage=1, circle_description=c(1:3)){
      del_row=c()
      for ( i in 1:nrow(circ) ){
          if ( sum(circ[i,-circle_description]>=filter.count)<filter.sample | circ_max_perc(circ[i,-circle_description], linear[i,-circle_description],Nreplicates=Nreplicates)<percentage)
              del_row = c(del_row,i)
      }

      if (length(del_row) > 0){
          new_dat=circ[-del_row,]
          return(new_dat)
      } else {
          return(circ)
      }
  }

  circ_max_perc <- function(circ=circ,linear=linear,Nreplicates=3){
      # convert to vector
      circ = as.numeric(circ)
      linear = as.numeric(linear)
      if( length(circ) != length(linear) ){
          stop ('Number of samples in circRNA is not equal to Hostgene.')
      }
      Ngroups = length(circ)/Nreplicates
      # calculate percentage
      circ_sum = unname(tapply(circ, (seq_along(1:length(circ))-1) %/% Nreplicates, sum ))
      linear_sum = unname(tapply(linear, (seq_along(1:length(linear))-1) %/% Nreplicates, sum ))
      perc = max(circ_sum / (circ_sum+linear_sum),na.rm=T)
      return(perc)
  }

  ## CIRC TEST

  Circ.test <- function(Circ, Linear, CircCoordinates=None, group, alpha=0.05, plotsig=T, circle_description = c(1:3)){

          # Requre packge
          require(aod)

          # check whether the input matrix are correct
          if ( nrow(Circ)!=nrow(Linear) | ncol(Circ) != ncol(Linear)){
                  stop('Circ data and Linear data are not matched, dimention different.')
          }

          # A vector for pvalue and directions indicator
          p.val <- c()
          direction <- c()

          # groups
          if ( length(group) != ncol(Circ)-length(circle_description) ){
                  print(length(group))
                  print(ncol(Circ)-length(circle_description))
                  stop("length of 'group' must be equal to the number of samples of 'Circ' and 'Linear'. ")
          }
          group <- factor(group)
          counter <- 0

          ## test
          # construct test matrix for each circRNA

          tmp_df = Circ[,FALSE]

          for (j in seq(1,length(unique(group)))){
                  tmp_df[paste("group_",j,"_ratio_mean",sep="")] <- NA
          }

          for ( i in rownames(Circ) ){
                  counter <- counter+1

                  # total read counts vector
                  tot <- round( as.numeric(Linear[i,-circle_description]) + as.numeric(Circ[i,-circle_description]) )

                  # circRNA read counts
                  circ <- as.numeric(Circ[i,-circle_description])

                  # if there is 0 in the total count vector, the model will fail. So permute 0 to 1
                  if ( 0 %in% tot ){
                      tot[tot==0]=1
                  }

                  if (counter %% 1000 == 0){
                          message(paste(counter, "candidates processed"))
                  }

                  tmp_rations <- data.frame(Ratio=as.numeric(Circ[i,-circle_description])/(as.numeric(Linear[i,-circle_description])+as.numeric(Circ[i,-circle_description])),
                  group=group)
                  for (rep_group in seq(1,max(as.numeric(levels(group))),1)){
                          tmp_df[i, paste("group_",rep_group,"_ratio_mean",sep="")] <- mean(na.omit(unlist(tmp_rations[tmp_rations$group==rep_group,1])))
                  }

                  # Constract data frame
                  testdat = data.frame(tot,circ,group)

                  ## do test
                  # Null model
                  fitNull <- betabin(cbind(circ,tot-circ) ~ 1, ~ 1, data=testdat)
                  # Alternative model
                  fitAlt <- betabin(cbind(circ,tot-circ) ~ group, ~ 1, data=testdat)
                  # test models
                  a <- anova(fitNull,fitAlt)
                  p.value <- a@anova.table[,11][2]
                  p.val <- c( p.val, p.value )
          }
          message(paste(counter, "candidates processed in total"))

          Circ$direction <- direction
          p.adj <- p.adjust(p.val,n=sum(!is.na(p.val)),'BH')
          # select significant ones
          sig_dat <- Circ[p.adj<=alpha    & !is.na(p.adj),]
          sig_ratios <- tmp_df[p.adj<=alpha    & !is.na(p.adj),]
          sig_p <- p.adj[p.adj<=alpha    & !is.na(p.adj)]

          # sort by p-val
          sig_dat <- sig_dat[order(sig_p),]
          sig_ratios <- sig_ratios[order(sig_p),]
          sig_p <- sort(sig_p)

          # A summary table
          if (missing(CircCoordinates)){
                  summary_table <- data.frame(sig_dat[,circle_description],sig_p,sig_dat[,circle_description])

                  rownames(summary_table) <- rownames(sig_dat)
                  names(summary_table) <- c(names(sig_dat)[circle_description],"sig_p",names(sig_ratios)[circle_description])
          } else {
                  summary_table <- cbind(CircCoordinates[rownames(sig_dat),],sig_p,sig_ratios)
                  colnames(summary_table) <- c(colnames(CircCoordinates),"sig_p",colnames(sig_ratios))
          }

          message(paste(nrow(summary_table), "candidates passed the specified thresholds"))

          # return all objects in a list
          return(list(summary_table=summary_table,
                              sig.dat=sig_dat,
                              p.val=p.val,
                              p.adj=p.adj,
                              sig_p=sig_p,
                              ratios=sig_ratios
                          )
                  )
  }

  ## MAIN

  circs = read.table("tx_counts_circs.tsv", header=T, sep="\t")
  genes = read.table("tx_counts_genes.tsv", header=T, sep="\t")
  pheno = read.csv  ("samplecircrna_condition.bk.csv"  , header=T, row.names = "sample")

  circs <- Circ.filter(circ = circs, linear = genes, filter.sample = 2, filter.count = 5, percentage = 0.00001)
  genes <- genes[rownames(circs),]

  description <- c(1)
  pheno <- pheno[colnames(circs[,-description]),,drop=FALSE]

  test <- Circ.test(circs, genes, group=as.numeric(as.factor(pheno$condition)), circle_description = description)
  write.table(test$summary_table, "tx_counts_summary.txt", row.names=F)

  pdf("circ_linear_ratio_plots.pdf", width = 8, height = 10)
  for (i in rownames(test$summary_table)) {
      Circ.ratioplot(circs, genes, plotrow=i, groupindicator1=pheno$condition,
          lab_legend = 'condition', circle_description = description )
  }
  dev.off()

  pdf("circ_linear_line_plots.pdf", width = 8, height = 10)
  for (i in rownames(test$summary_table)) {
      Circ.lineplot(circs, genes, plotrow=i, groupindicator1=pheno$condition,
          circle_description = description )
  }
  dev.off()

  ################################################
  ################################################
  ## VERSIONS FILE                              ##
  ################################################
  ################################################

  r.version <- strsplit(version[['version.string']], ' ')[[1]][3]

  writeLines(
      c(
          '"NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST":',
          paste('    r-base:', r.version),
          paste('    aod:', packageVersion('aod')),
          paste('    plyr:', packageVersion('plyr')),
          paste('    ggplot2:', packageVersion('ggplot2'))
      ),
  'versions.yml')

  ################################################
  ################################################
  ################################################
  ################################################

Command exit status:
  1

Command output:
  (empty)

Command error:
  Loading required package: aod
  Loading required package: plyr
  Loading required package: ggplot2
  Warning message:
  In max(circ_sum/(circ_sum + linear_sum), na.rm = T) :
    no non-missing arguments to max; returning -Inf
  756 candidates processed in total
  756 candidates passed the specified thresholds
  There were 50 or more warnings (use warnings() to see the first 50)
  Error: Must request at least one colour from a hue palette.
  In addition: There were 33 warnings (use warnings() to see them)
  Execution halted
xfk274280 commented 2 weeks ago

image

If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file? Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?

nictru commented 2 weeks ago

Hey

I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out

This was true until #166 was merged. Since then, this functionality is also available on the dev branch.

If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file?

DESeq2 is not made for differential expression on transcript level, but I know it is frequently used in the circRNA space. This files contain the output of psirc-quant, which is an adaption of kallisto for quantifying circRNAs. If you want to, you can use them with DESeq. The experiments.merged.rds contains a summarizedExperiment that can be used with fishpond/swish, which is made for differential expression on transcript level.

Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?

Check out this section of the docs. The *.annotated.bed and the *.annotated.gtf should contain the annotations based on the files you provided via the --annotation parameter.

xfk274280 commented 2 weeks ago

Thank you for your reply.

Hey

I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out

This was true until #166 was merged. Since then, this functionality is also available on the dev branch.

If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file?

DESeq2 is not made for differential expression on transcript level, but I know it is frequently used in the circRNA space. This files contain the output of psirc-quant, which is an adaption of kallisto for quantifying circRNAs. If you want to, you can use them with DESeq. The experiments.merged.rds contains a summarizedExperiment that can be used with fishpond/swish, which is made for differential expression on transcript level.

Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?

Check out this section of the docs. The *.annotated.bed and the *.annotated.gtf should contain the annotations based on the files you provided via the --annotation parameter.