Closed xfk274280 closed 1 week ago
Hey, you seem to have a problem similar to this. I think this is more a problem with your server than with the pipeline.
Also, is there a reason why you use the sponging? It is not ready for productive usage yet.
I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out.Based on the previous tips, I re-ran the program for the 'annotation' step and it seems to have succeeded, but other errors occurred.
Execution cancelled -- Finishing pending tasks before exit
-[nf-core/circrna] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST'
Caused by:
Process `NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST` terminated with an error exit status (1)
Command executed [~/nf_core/nf-core-circrna_sponging/sponging/./workflows/circrna/../../subworkflows/local/../../modules/local/circtest/circtest/templates/circtest.R]:
#!/usr/bin/env Rscript
require(aod)
require(plyr)
require(ggplot2)
## CircTest functions
## Package: CircTest (https://github.com/dieterich-lab/CircTest)
## Version: 0.1.1
## Author(s): Jun Cheng, Tobias Jakobi
## License: GPL
## SUMMMARY
#'@title Summarizes data
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
## data: a data frame.
## measurevar: the name of a column that contains the variable to be summariezed
## groupvars: a vector containing names of columns that contain grouping variables
## na.rm: a boolean that indicates whether to ignore NA's
## conf.interval: the percent range of the confidence interval (default is 95%)
#'
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95, .drop=TRUE) {
require(plyr)
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
## RATIO PLOT
Circ.ratioplot <- function(Circ,Linear,CircCoordinates = None,plotrow='1',size=24,ncol=2,groupindicator1=NULL,groupindicator2=NULL,x='Conditions',y='circRNA/(circRNA+Linear)',lab_legend='groupindicator1', circle_description = c(1:3), gene_column = None, y_axis_range = 1, colour_mode = "colour"){
if( !is.null(groupindicator1) & length(groupindicator1) != ncol(Circ)-length(circle_description) ){
stop("If provided, the length of groupindicator1 should be equal to the number of samples.")
}
if( !is.null(groupindicator2) & length(groupindicator2) != ncol(Circ)-length(circle_description) ){
stop("If provided, the length of groupindicator2 should be equal to the number of samples.")
}
if(is.null(groupindicator1)){
stop("At least one grouping should be provided through groupindicator1.")
}
if(!is.null(groupindicator2)){
twolevel <- TRUE
}else{
twolevel <- FALSE
}
rownames.circ <- rownames(Circ)
Circ <- data.frame(lapply(Circ, as.character), stringsAsFactors=FALSE)
rownames(Circ) <- rownames.circ
rownames.linear <- rownames(Linear)
Linear <- data.frame(lapply(Linear, as.character), stringsAsFactors=FALSE)
rownames(Linear) <- rownames.linear
if(!missing(CircCoordinates)){
rownames.circCoordinates <- rownames(CircCoordinates)
CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
rownames(CircCoordinates) <- rownames.circCoordinates
}else{
CircCoordinates <- data.frame(Circ[,circle_description])
rownames(CircCoordinates) <- rownames.circ
rownames.circCoordinates <- rownames(CircCoordinates)
CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
rownames(CircCoordinates) <- rownames.circCoordinates
}
groupindicator1 <- factor(groupindicator1,levels=unique(groupindicator1))
groupindicator2 <- factor(groupindicator2,levels=unique(groupindicator2))
# Get gene name, if no annotation, output NULL
if (is.character(plotrow)){
if ( ! plotrow %in% rownames(CircCoordinates) ){
stop("Specified 'plotrow' not found.")
}
}else{
if ( is.numeric(plotrow) ){
if ( ! plotrow %in% 1:nrow(CircCoordinates) ){
stop("Specified 'plotrow' not found.")
}
}else{
stop("Specified plotrow should be ONE rowname or ONE rownumber.")
}
}
# Choose your own column containing the gene name using gene_column. The genename will be displayed in the plot title if available
if (missing(gene_column)){
genename = NULL
}else{
genename <- as.character(CircCoordinates[plotrow,gene_column])
if (genename == '.'){
genename = NULL
}
}
if(twolevel){
plotdat <- summarySE( data.frame(Ratio=as.numeric(Circ[plotrow,-circle_description])/(as.numeric(Linear[plotrow,-circle_description])+as.numeric(Circ[plotrow,-circle_description])),
groupindicator1,
groupindicator2),
measurevar='Ratio',groupvars=c('groupindicator1','groupindicator2') )
}else{
plotdat <- summarySE( data.frame(Ratio=as.numeric(Circ[plotrow,-circle_description])/(as.numeric(Linear[plotrow,-circle_description])+as.numeric(Circ[plotrow,-circle_description])),
groupindicator1),
measurevar='Ratio',groupvars=c('groupindicator1') )
}
# construct plot
Q <- ggplot(plotdat, aes(x=groupindicator1, y=Ratio)) +
geom_boxplot() + theme_classic() +
theme(axis.text.x = element_blank())+
theme(axis.text.y = element_text(size=size+4))+
theme(axis.ticks = element_line(colour = 'black', size = 1)) +
theme(axis.ticks.x = element_blank())+
theme(legend.title=element_blank()) +
theme(text=element_text(size=size+4))+
theme(legend.text=element_text(size=size)) +
theme(plot.title = element_text(size=size)) +
theme(axis.text.y = element_text(margin=margin(5,5,10,5,"pt")))+
ggtitle(paste("Annotation: ", genename, "\nChr ", toString(Circ[plotrow,circle_description]),sep="")) +
ylab("circRNA/(circRNA + Linear RNA)") +
xlab("Sample") +
geom_errorbar(aes(ymin=Ratio, ymax=Ratio+se), width=.2 , size=2) +
geom_bar(stat="identity",aes(fill=groupindicator1), color = "black", size=2)
if (colour_mode == "bw"){
Q <- Q + scale_fill_grey(start = 0.0, end = 1)
} else {
Q <- Q + scale_fill_discrete(name=lab_legend)
}
Q <- Q +
theme(legend.position="bottom") +
theme(axis.ticks.length = unit(0.5, "cm")) +
theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = "black"),
panel.border = element_rect(colour = "black", fill=NA, size=3)) +
guides(fill=guide_legend(
keywidth=0.3,
keyheight=0.3,
default.unit="inch")
) + scale_y_continuous(expand=c(0,0), limits= c(0, y_axis_range))
if(twolevel){
Q <- Q + facet_wrap( ~ groupindicator2,ncol=ncol )
}
print(Q)
}
## LINEPLOT
Circ.lineplot <- function(Circ,Linear,CircCoordinates = None,plotrow='1',size=18,ncol=2,groupindicator1=NULL,groupindicator2=NULL,x='Conditions',y='Counts', circle_description = c(1:3), gene_column = None){
require(ggplot2)
if( !is.null(groupindicator1) & length(groupindicator1) != ncol(Circ)-length(circle_description) ){
stop("If provided, the length of groupindicator1 should be equal to the number of samples.")
}
if( !is.null(groupindicator2) & length(groupindicator2) != ncol(Circ)-length(circle_description) ){
stop("If provided, the length of groupindicator2 should be equal to the number of samples.")
}
if(is.null(groupindicator1)){
stop("At least one grouping should be provided through groupindicator1.")
}
if(!is.null(groupindicator2)){
twolevel <- TRUE
}else{
twolevel <- FALSE
}
rownames.circ <- rownames(Circ)
Circ <- data.frame(lapply(Circ, as.character), stringsAsFactors=FALSE)
rownames(Circ) <- rownames.circ
rownames.linear <- rownames(Linear)
Linear <- data.frame(lapply(Linear, as.character), stringsAsFactors=FALSE)
rownames(Linear) <- rownames.linear
# if CircCoordinates are available, use them, otherwise get more information from the Circ table, as indicated by the circle_description columns.
if(!missing(CircCoordinates)){
rownames.circCoordinates <- rownames(CircCoordinates)
CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
rownames(CircCoordinates) <- rownames.circCoordinates
}else{
CircCoordinates <- data.frame(Circ[,circle_description])
rownames(CircCoordinates) <- rownames.circ
rownames.circCoordinates <- rownames(CircCoordinates)
CircCoordinates <- data.frame(lapply(CircCoordinates, as.character), stringsAsFactors=FALSE)
rownames(CircCoordinates) <- rownames.circCoordinates
}
groupindicator1 <- factor(groupindicator1,levels=unique(groupindicator1))
groupindicator2 <- factor(groupindicator2,levels=unique(groupindicator2))
# Get gene name, if no annotation, output NULL
if (is.character(plotrow)){
if ( ! plotrow %in% rownames(CircCoordinates) ){
stop("Specified 'plotrow' not found.")
}
}else{
if ( is.numeric(plotrow) ){
if ( ! plotrow %in% 1:nrow(CircCoordinates) ){
stop("Specified 'plotrow' not found.")
}
}else{
stop("Specified plotrow should be ONE rowname or ONE rownumber.")
}
}
# Choose your own column containing the gene name using gene_column. The genename will be displayed in the plot title if available
if (missing(gene_column)){
genename = NULL
}else{
genename <- as.character(CircCoordinates[plotrow,gene_column])
if (genename == '.'){
genename = NULL
}
}
plot.func <- function(row=plotrow){
if(twolevel){
plotdat <- summarySE(data.frame(Counts=c(as.numeric(Circ[row,-circle_description]),as.numeric(Linear[row,-circle_description])),
groupindicator1,
groupindicator2,
Type=c(rep('circRNA',ncol(Circ)-length(circle_description)),rep('linear RNA',ncol(Circ)-length(circle_description)))
), measurevar='Counts',groupvars=c('Type','groupindicator1','groupindicator2') )
}else{
plotdat <- summarySE(data.frame(Counts=c(as.numeric(Circ[row,-circle_description]),as.numeric(Linear[row,-circle_description])),
groupindicator1,
Type=c(rep('circRNA',ncol(Circ)-length(circle_description)),rep('linear RNA',ncol(Circ)-length(circle_description)))
), measurevar='Counts',groupvars=c('Type','groupindicator1') )
}
Q=ggplot(plotdat, aes(x=groupindicator1, y=Counts, group=Type,colour=Type)) +
theme(text=element_text(size=size))+
theme_bw()+
labs( list(title=paste(toString(Circ[row,circle_description]),genename,sep=" "),x=x,y=y) ) +
ggtitle(paste(toString(Circ[row,circle_description]),genename,sep=" "))+
geom_errorbar(aes(ymin=Counts-se, ymax=Counts+se), width=.1, position=position_dodge(.1) ) +
xlab("Condition") +
geom_line(position=position_dodge(.1)) +
geom_point(position=position_dodge(.1))
if (twolevel){
Q = Q + facet_wrap( ~ groupindicator2,ncol=ceiling(sqrt(length(levels(groupindicator2)))) )
}
print(Q)
}
return(plot.func(row=plotrow))
}
## FILTER
Circ.filter <- function(circ=circ,linear=linear,Nreplicates=3,filter.sample=4,filter.count=5,percentage=1, circle_description=c(1:3)){
del_row=c()
for ( i in 1:nrow(circ) ){
if ( sum(circ[i,-circle_description]>=filter.count)<filter.sample | circ_max_perc(circ[i,-circle_description], linear[i,-circle_description],Nreplicates=Nreplicates)<percentage)
del_row = c(del_row,i)
}
if (length(del_row) > 0){
new_dat=circ[-del_row,]
return(new_dat)
} else {
return(circ)
}
}
circ_max_perc <- function(circ=circ,linear=linear,Nreplicates=3){
# convert to vector
circ = as.numeric(circ)
linear = as.numeric(linear)
if( length(circ) != length(linear) ){
stop ('Number of samples in circRNA is not equal to Hostgene.')
}
Ngroups = length(circ)/Nreplicates
# calculate percentage
circ_sum = unname(tapply(circ, (seq_along(1:length(circ))-1) %/% Nreplicates, sum ))
linear_sum = unname(tapply(linear, (seq_along(1:length(linear))-1) %/% Nreplicates, sum ))
perc = max(circ_sum / (circ_sum+linear_sum),na.rm=T)
return(perc)
}
## CIRC TEST
Circ.test <- function(Circ, Linear, CircCoordinates=None, group, alpha=0.05, plotsig=T, circle_description = c(1:3)){
# Requre packge
require(aod)
# check whether the input matrix are correct
if ( nrow(Circ)!=nrow(Linear) | ncol(Circ) != ncol(Linear)){
stop('Circ data and Linear data are not matched, dimention different.')
}
# A vector for pvalue and directions indicator
p.val <- c()
direction <- c()
# groups
if ( length(group) != ncol(Circ)-length(circle_description) ){
print(length(group))
print(ncol(Circ)-length(circle_description))
stop("length of 'group' must be equal to the number of samples of 'Circ' and 'Linear'. ")
}
group <- factor(group)
counter <- 0
## test
# construct test matrix for each circRNA
tmp_df = Circ[,FALSE]
for (j in seq(1,length(unique(group)))){
tmp_df[paste("group_",j,"_ratio_mean",sep="")] <- NA
}
for ( i in rownames(Circ) ){
counter <- counter+1
# total read counts vector
tot <- round( as.numeric(Linear[i,-circle_description]) + as.numeric(Circ[i,-circle_description]) )
# circRNA read counts
circ <- as.numeric(Circ[i,-circle_description])
# if there is 0 in the total count vector, the model will fail. So permute 0 to 1
if ( 0 %in% tot ){
tot[tot==0]=1
}
if (counter %% 1000 == 0){
message(paste(counter, "candidates processed"))
}
tmp_rations <- data.frame(Ratio=as.numeric(Circ[i,-circle_description])/(as.numeric(Linear[i,-circle_description])+as.numeric(Circ[i,-circle_description])),
group=group)
for (rep_group in seq(1,max(as.numeric(levels(group))),1)){
tmp_df[i, paste("group_",rep_group,"_ratio_mean",sep="")] <- mean(na.omit(unlist(tmp_rations[tmp_rations$group==rep_group,1])))
}
# Constract data frame
testdat = data.frame(tot,circ,group)
## do test
# Null model
fitNull <- betabin(cbind(circ,tot-circ) ~ 1, ~ 1, data=testdat)
# Alternative model
fitAlt <- betabin(cbind(circ,tot-circ) ~ group, ~ 1, data=testdat)
# test models
a <- anova(fitNull,fitAlt)
p.value <- a@anova.table[,11][2]
p.val <- c( p.val, p.value )
}
message(paste(counter, "candidates processed in total"))
Circ$direction <- direction
p.adj <- p.adjust(p.val,n=sum(!is.na(p.val)),'BH')
# select significant ones
sig_dat <- Circ[p.adj<=alpha & !is.na(p.adj),]
sig_ratios <- tmp_df[p.adj<=alpha & !is.na(p.adj),]
sig_p <- p.adj[p.adj<=alpha & !is.na(p.adj)]
# sort by p-val
sig_dat <- sig_dat[order(sig_p),]
sig_ratios <- sig_ratios[order(sig_p),]
sig_p <- sort(sig_p)
# A summary table
if (missing(CircCoordinates)){
summary_table <- data.frame(sig_dat[,circle_description],sig_p,sig_dat[,circle_description])
rownames(summary_table) <- rownames(sig_dat)
names(summary_table) <- c(names(sig_dat)[circle_description],"sig_p",names(sig_ratios)[circle_description])
} else {
summary_table <- cbind(CircCoordinates[rownames(sig_dat),],sig_p,sig_ratios)
colnames(summary_table) <- c(colnames(CircCoordinates),"sig_p",colnames(sig_ratios))
}
message(paste(nrow(summary_table), "candidates passed the specified thresholds"))
# return all objects in a list
return(list(summary_table=summary_table,
sig.dat=sig_dat,
p.val=p.val,
p.adj=p.adj,
sig_p=sig_p,
ratios=sig_ratios
)
)
}
## MAIN
circs = read.table("tx_counts_circs.tsv", header=T, sep="\t")
genes = read.table("tx_counts_genes.tsv", header=T, sep="\t")
pheno = read.csv ("samplecircrna_condition.bk.csv" , header=T, row.names = "sample")
circs <- Circ.filter(circ = circs, linear = genes, filter.sample = 2, filter.count = 5, percentage = 0.00001)
genes <- genes[rownames(circs),]
description <- c(1)
pheno <- pheno[colnames(circs[,-description]),,drop=FALSE]
test <- Circ.test(circs, genes, group=as.numeric(as.factor(pheno$condition)), circle_description = description)
write.table(test$summary_table, "tx_counts_summary.txt", row.names=F)
pdf("circ_linear_ratio_plots.pdf", width = 8, height = 10)
for (i in rownames(test$summary_table)) {
Circ.ratioplot(circs, genes, plotrow=i, groupindicator1=pheno$condition,
lab_legend = 'condition', circle_description = description )
}
dev.off()
pdf("circ_linear_line_plots.pdf", width = 8, height = 10)
for (i in rownames(test$summary_table)) {
Circ.lineplot(circs, genes, plotrow=i, groupindicator1=pheno$condition,
circle_description = description )
}
dev.off()
################################################
################################################
## VERSIONS FILE ##
################################################
################################################
r.version <- strsplit(version[['version.string']], ' ')[[1]][3]
writeLines(
c(
'"NFCORE_CIRCRNA:CIRCRNA:STATISTICAL_TESTS:CIRCTEST_CIRCTEST":',
paste(' r-base:', r.version),
paste(' aod:', packageVersion('aod')),
paste(' plyr:', packageVersion('plyr')),
paste(' ggplot2:', packageVersion('ggplot2'))
),
'versions.yml')
################################################
################################################
################################################
################################################
Command exit status:
1
Command output:
(empty)
Command error:
Loading required package: aod
Loading required package: plyr
Loading required package: ggplot2
Warning message:
In max(circ_sum/(circ_sum + linear_sum), na.rm = T) :
no non-missing arguments to max; returning -Inf
756 candidates processed in total
756 candidates passed the specified thresholds
There were 50 or more warnings (use warnings() to see the first 50)
Error: Must request at least one colour from a hue palette.
In addition: There were 33 warnings (use warnings() to see them)
Execution halted
If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file? Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?
Hey
I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out
This was true until #166 was merged. Since then, this functionality is also available on the dev
branch.
If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file?
DESeq2 is not made for differential expression on transcript level, but I know it is frequently used in the circRNA space. This files contain the output of psirc-quant, which is an adaption of kallisto for quantifying circRNAs. If you want to, you can use them with DESeq. The experiments.merged.rds contains a summarizedExperiment that can be used with fishpond/swish, which is made for differential expression on transcript level.
Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?
Check out this section of the docs. The *.annotated.bed
and the *.annotated.gtf
should contain the annotations based on the files you provided via the --annotation
parameter.
Thank you for your reply.
Hey
I've heard that the 'sponging' branch can speed up the 'miRNA_prediction' step, so I tried it out
This was true until #166 was merged. Since then, this functionality is also available on the
dev
branch.If I only need the circular RNA matrix and want to perform differential analysis with DESeq2, can I directly use the 'quantification/combined/circular.tsv' file?
DESeq2 is not made for differential expression on transcript level, but I know it is frequently used in the circRNA space. This files contain the output of psirc-quant, which is an adaption of kallisto for quantifying circRNAs. If you want to, you can use them with DESeq. The experiments.merged.rds contains a summarizedExperiment that can be used with fishpond/swish, which is made for differential expression on transcript level.
Also, how can I convert the IDs, such as 'circ_9:127887483-127896332:-', to the IDs in the annotation CircBase database?
Check out this section of the docs. The
*.annotated.bed
and the*.annotated.gtf
should contain the annotations based on the files you provided via the--annotation
parameter.
Description of the bug
Here are my annotation file and bed file. Are the formats correct?
Command used and terminal output
Relevant files
No response
System information
No response