nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
384 stars 400 forks source link

`ASSESS_SIGNIFICANCE` fails with Rscript error #1239

Closed bounlu closed 1 month ago

bounlu commented 11 months ago

Description of the bug

ASSESS_SIGNIFICANCE step keeps failing at Rscript with the below error:

ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:ASSESS_SIGNIFICANCE (AF60_72U_vs_AF00_69U)'

Caused by:
  Process `NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:ASSESS_SIGNIFICANCE (AF60_72U_vs_AF00_69U)` terminated with an error exit status (1)

Command executed:

  cat $(which assess_significance.R) | R --slave --args AF60_72U_vs_AF00_69U.tumor.mpileup.gz_CNVs AF60_72U_vs_AF00_69U.tumor.mpileup.gz_normal_CNVs AF60_72U_vs_AF00_69U.tumor.mpileup.gz_normal_ratio.txt AF60_72U_vs_AF00_69U.tumor.mpileup.gz_ratio.txt

  mv *.p.value.txt AF60_72U_vs_AF00_69U.p.value.txt

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:ASSESS_SIGNIFICANCE":
      controlfreec: $(echo $(freec -version 2>&1) | sed 's/^.*Control-FREEC  //; s/:.*$//' | sed -e "s/Control-FREEC v//g" )
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Loading required package: GenomicRanges
  Loading required package: stats4
  Loading required package: BiocGenerics
  Loading required package: parallel

  Attaching package: ‘BiocGenerics’

  The following objects are masked from ‘package:parallel’:

      clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
      clusterExport, clusterMap, parApply, parCapply, parLapply,
      parLapplyLB, parRapply, parSapply, parSapplyLB

  The following objects are masked from ‘package:stats’:

      IQR, mad, sd, var, xtabs

  The following objects are masked from ‘package:base’:

      anyDuplicated, append, as.data.frame, basename, cbind, colnames,
      dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
      grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
      order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
      rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
      union, unique, unsplit, which.max, which.min

  Loading required package: S4Vectors

  Attaching package: ‘S4Vectors’

  The following object is masked from ‘package:base’:

      expand.grid

  Loading required package: IRanges
  Loading required package: GenomeInfoDb
  Error in `$<-.data.frame`(`*tmp*`, Ratio, value = logical(0)) : 
    replacement has 0 rows, data has 701
  Calls: $<- -> $<-.data.frame
  Execution halted

Work dir:
  /nextflow/sarek/work/bc/7f398053d6401db4e9aa9a25eca18c

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

Command used and terminal output

nextflow run nf-core/sarek \
-latest \
-profile docker \
--step mapping \
--tools freebayes,mpileup,mutect2,strelka,manta,tiddit,ascat,cnvkit,controlfreec,msisensorpro \
--input 'samplesheet_sarek.csv' \
--outdir '/nextflow/sarek/results/' \
-work-dir '/nextflow/sarek/work/' \
-c 'custom.config' \
-r master \
-resume

Relevant files

No response

System information

N E X T F L O W ~ version 23.08.1-edge local Docker Linux nf-core/sarek master v3.3.0

FriederikeHanssen commented 11 months ago

urgh controlfreec. I just updated the containers. You could check if the new version fixes it by adding this to a custom config:

process {
     withName: ASSESS_SIGNIFICANCE {
        container = 'biocontainers/control-freec:11.6b--hdbdd923_0'
       }
}
nschcolnicov commented 11 months ago

Hi! I Just got this same error, it doesn't appear to be related to the new container update, since I was able to reproduce with both the old and new container. @bounlu could you please share the nextflow.log file?

Looking at the .command.sh in my run, I saw that the script is being called with 4 arguments instead of 2, as the CONTROLFREEC_ASSESSSIGNIFICANCE process expects:

.command.sh:

cat $(which assess_significance.R) | R --slave --args dataset1_vs_control01.tumor.mpileup.gz_CNVs dataset1_vs_control01.tumor.mpileup.gz_normal_CNVs dataset1_vs_control01.tumor.mpileup.gz_normal_ratio.txt dataset1_vs_control01.tumor.mpileup.gz_ratio.txt

mv *.p.value.txt dataset1_vs_control01.p.value.txt

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:ASSESS_SIGNIFICANCE":
    controlfreec: $(echo $(freec -version 2>&1) | sed 's/^.*Control-FREEC  //; s/:.*$//' | sed -e "s/Control-FREEC v//g" )
END_VERSIONS

So instead of running the assess_significance.R script on the cnv and ratio file, it is being run on the cnv and cnv normal files. And the assess_significance.R script fails in the line 13: ratio$Ratio[which(ratio$Ratio==-1)]=NA Because the ratio dataframe doesn't contain any columns labeled "Ratio"

In my case I'm getting two files out of the FREEC_SOMATIC module associated to the same meta dict for both CNV and RATIO. Looking at the way that the input tuple is generated, in the BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC subworkflow, in line 30. In my run this is what the FREEC_SOMATIC.out.CNV and the FREEC_SOMATIC.out.ratio contained respectively:

FREEC_SOMATIC.out.CNV.view()
[[id:dataset1_vs_control01, normal_id:control01, patient:test3, sex:XX, tumor_id:dataset1, variantcaller:samtools], [tmpdir/dataset1_vs_control01.tumor.mpileup.gz_CNVs, tmpdir/dataset1_vs_control01.tumor.mpileup.gz_normal_CNVs]]

FREEC_SOMATIC.out.ratio.view()
[[id:dataset1_vs_control01, normal_id:control01, patient:test3, sex:XX, tumor_id:dataset1, variantcaller:samtools], [tmpdir/dataset1_vs_control01.tumor.mpileup.gz_normal_ratio.txt, tmpdir/dataset1_vs_control01.tumor.mpileup.gz_ratio.txt]]

FREEC_SOMATIC.out.CNV.join(FREEC_SOMATIC.out.ratio, failOnDuplicate: true, failOnMismatch: true).view()
[[id:dataset1_vs_control01, normal_id:control01, patient:test3, sex:XX, tumor_id:dataset1, variantcaller:samtools], [/scratch/nextflow/work/schcolni/7f/3b1037262fd54a07b5a24d5df8d559/dataset1_vs_control01.tumor.mpileup.gz_CNVs, /scratch/nextflow/work/schcolni/7f/3b1037262fd54a07b5a24d5df8d559/dataset1_vs_control01.tumor.mpileup.gz_normal_CNVs], [/scratch/nextflow/work/schcolni/7f/3b1037262fd54a07b5a24d5df8d559/dataset1_vs_control01.tumor.mpileup.gz_normal_ratio.txt, /scratch/nextflow/work/schcolni/7f/3b1037262fd54a07b5a24d5df8d559/dataset1_vs_control01.tumor.mpileup.gz_ratio.txt]]

I'm not sure what is the expected behavior of this module. @FriederikeHanssen, should it run on both the normal and non-normal files, or if it should only run on the non-normal files?

FriederikeHanssen commented 11 months ago

Controlfreec should run on tumor-only and paired data.

bounlu commented 11 months ago

@FriederikeHanssen Thanks for the updated container, I will also test with that and let you know.

@nschcolnicov My .command.sh also has 4 paramaters, and my data is paired.

nschcolnicov commented 11 months ago

Hi @bounlu @FriederikeHanssen I kept researching this issue, and found this: When running the somatic tumor pair test dataset provided by sarek using this command line: nextflow run ../../main.nf -profile bi,aws,test_cache,tools_somatic --tools strelka,controlfreec --outdir ./results the "*gz_normal_CNVs" and the "_normal_ratio.txt" files are not generated, we only find the "gz_CNVs" and the "*gz_ratio.txt". But I'm not sure if this depends on the dataset.

I'm still not sure whether if we should run the assess_significance module on the two file pairs of cnv and ratio, those with the "normal" label and those without the "normal" label; or, if we should only run it on the file pairs that don't have the "normal" label. So I implemented a solution for the first case, since we can just ignore the results for the second pair if they are not relevant.

I implemented the following solution in my local version of the pipeline: Edited the bam_variant_calling_somatic_controlfreec subworkflow main file By replacing the lines between line 27 and line 30, inclusive, with these lines:

    assess_significance_input = Channel.empty()

    FREEC_SOMATIC(controlfreec_input, fasta, fasta_fai, [], dbsnp, dbsnp_tbi, chr_files, mappability, intervals_bed, [])

    //Filter the files that come out of freec somatic as ASSESS_SIGNIFICANCE only takes one cnv and one ratio file, there should be one normal and one non normal pair?
    //Creates empty channel if file is missing
    cnv_files = FREEC_SOMATIC.out.CNV
    .multiMap{ meta, cnv ->
        def meta_clone_tumor = meta.clone()
        def meta_clone_normal = meta.clone()
        meta_clone_tumor.id = meta.id + "_tumor" //updating meta id so that the p.value file is named differently 
        meta_clone_normal.id = meta.id + "_normal" //updating meta id so that the p.value file is named differently 

        def tumor_file = cnv instanceof List ? cnv.find { it.toString().endsWith("gz_CNVs") } : cnv //only find if its a list, else it returns only the filename without the path
        def normal_file = cnv instanceof List ? cnv.find { it.toString().endsWith("gz_normal_CNVs") } : null //only find if its a list, else it returns only the filename without the path

        normal: normal_file == null ? [] : [meta_clone_normal,normal_file] //only fill channel if file was found, else leave it empty
        tumor: tumor_file == null ? [] : [meta_clone_tumor,tumor_file] //only fill channel if file was found, else leave it empty
    }

    ratio_files = FREEC_SOMATIC.out.ratio
    .multiMap{ meta, ratio ->
        def meta_clone_tumor = meta.clone()
        def meta_clone_normal = meta.clone()
        meta_clone_tumor.id = meta.id + "_tumor" //updating meta id so that the p.value file is named differently 
        meta_clone_normal.id = meta.id + "_normal" //updating meta id so that the p.value file is named differently 

        def tumor_file = ratio instanceof List ? ratio.find { it.toString().endsWith("gz_ratio.txt") } : ratio //same here as cnv
        def normal_file = ratio instanceof List ? ratio.find { it.toString().endsWith("gz_normal_ratio.txt") } : null //same here as cnv

        normal: normal_file == null ? [] : [meta_clone_normal,normal_file] //same here as ratio
        tumor: tumor_file == null ? [] : [meta_clone_tumor,tumor_file] //same here as ratio
    }

    //Join the pairs
    normal_files = cnv_files.normal.join(ratio_files.normal, failOnDuplicate: true, failOnMismatch: true)
    tumor_files = cnv_files.tumor.join(ratio_files.tumor, failOnDuplicate: true, failOnMismatch: true)

    //Mix all the pairs into input channel
    assess_significance_input = assess_significance_input.mix(tumor_files)
    assess_significance_input = assess_significance_input.mix(normal_files)

    ASSESS_SIGNIFICANCE(assess_significance_input)

This is the full version of the bam_variant_calling_somatic_controlfreec subworkflow main file, you can just replace your with this: bam_variant_calling_somatic_controlfreec_main.txt

This is how the channel "assess_significance_input" looks like, so the module runs first on one pair, and then on the other:

[[id:dataset1_vs_control01, normal_id:control01, patient:test3, sex:XX, tumor_id:dataset1, variantcaller:samtools], tmpdir/dataset1_vs_control01.tumor.mpileup.gz_normal_CNVs, tmpdir/dataset1_vs_control01.tumor.mpileup.gz_normal_ratio.txt]
[[id:dataset1_vs_control01, normal_id:control01, patient:test3, sex:XX, tumor_id:dataset1, variantcaller:samtools], tmpdir/dataset1_vs_control01.tumor.mpileup.gz_CNVs, tmpdir/dataset1_vs_control01.tumor.mpileup.gz_ratio.txt]

I tested this implementation on runs that generate the "normal" file pairs, and runs that don't generate the "normal" files, and they both ran succesfully, the only different in output is that the runs that generate "normal" cnv and ratio files will include the p.value.txt file for them too. This is what the controlfreec folder in my results/variant_calling directory looked like:

controlfreec/
├── dataset1_vs_control01
│   ├── config.txt
│   ├── dataset1_vs_control01_BAF.png
│   ├── dataset1_vs_control01.bed
│   ├── dataset1_vs_control01.circos.txt
│   ├── dataset1_vs_control01.normal.mpileup.gz_BAF.txt
│   ├── dataset1_vs_control01.normal.mpileup.gz_control.cpn
│   ├── dataset1_vs_control01.p.value.txt
│   ├── dataset1_vs_control01_ratio.log2.png
│   ├── dataset1_vs_control01_ratio.png
│   ├── dataset1_vs_control01.tumor.mpileup.gz_BAF.txt
│   ├── dataset1_vs_control01.tumor.mpileup.gz_CNVs
│   ├── dataset1_vs_control01.tumor.mpileup.gz_info.txt
│   ├── dataset1_vs_control01.tumor.mpileup.gz_normal_CNVs
│   ├── dataset1_vs_control01.tumor.mpileup.gz_normal_ratio.BedGraph
│   ├── dataset1_vs_control01.tumor.mpileup.gz_normal_ratio.txt
│   ├── dataset1_vs_control01.tumor.mpileup.gz_ratio.BedGraph
│   ├── dataset1_vs_control01.tumor.mpileup.gz_ratio.txt
│   └── dataset1_vs_control01.tumor.mpileup.gz_sample.cpn
├── dataset1_vs_control01_normal
│   └── dataset1_vs_control01_normal.p.value.txt
└── dataset1_vs_control01_tumor
    └── dataset1_vs_control01_tumor.p.value.txt
bounlu commented 11 months ago

I confirm that I still get the same error with the updated container.

bounlu commented 11 months ago

In the new version 3.3.2, I still get error from CONTROLFREEC, this time from MAKEGRAPH process:


Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:MAKEGRAPH (XXX_7D_vs_YYY_6D)'

Caused by:
  Process `NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:MAKEGRAPH (XXX_7D_vs_YYY_6D)` terminated with an error exit status (1)

Command executed:

  cat $(which makeGraph.R) | R --slave --args 2 XXX_7D_vs_YYY_6D.tumor.mpileup.gz_normal_ratio.txt XXX_7D_vs_YYY_6D.tumor.mpileup.gz_ratio.txt XXX_7D_vs_YYY_6D.normal.mpileup.gz_BTT.txt XXX_7D_vs_YYY_6D.tumor.mpileup.gz_BTT.txt

  mv *_BAF.txt.png XXX_7D_vs_YYY_6D_BAF.png
  mv *_ratio.txt.log2.png XXX_7D_vs_YYY_6D_ratio.log2.png
  mv *_ratio.txt.png XXX_7D_vs_YYY_6D_ratio.png

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:MAKEGRAPH":
      controlfreec: $(echo $(freec -version 2>&1) | sed 's/^.*Control-FREEC  //; s/:.*$//' | sed -e "s/Control-FREEC v//g" )
  END_VERSIONS

Command exit status:
  1

Command output:
  null device 
            1 
  null device 
            1 

Command error:
  Warning message:
  In type.convert.default(args[4]) :
    'as.is' should be specified by the caller; using TRUE
  There were 50 or more warnings (use warnings() to see the first 50)
  null device 
            1 
  null device 
            1 
  Error in xy.coords(x, y, xlabel, ylabel, log) : 
    'x' and 'y' lengths differ
  Calls: plot -> plot.default -> xy.coords
  Execution halted

Work dir:
  /nextflow/sarek/2205/work/3f/795232eTT93c4eb7b46d3f67d55b05

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for detYYls
Join mismatch for the following entries: key=[patient:TT, sample:TTT_6D, sex:XY, status:0, id:TTT_6D, data_type:cram, variantcaller:bcftools, file_name:TTT_6D.bcftools.vcf] values=

Join mismatch for the following entries: 
- key=[patient:YY, sample:YYY_6D, sex:XX, status:0, id:YYY_6D, data_type:cram, variantcaller:bcftools, file_name:YYY_6D.bcftools.vcf] values= 
- key=[patient:TT, sample:TTT_6D, sex:XY, status:0, id:TTT_6D, data_type:cram, variantcaller:bcftools, file_name:TTT_6D.bcftools.vcf] values=

-[nf-core/sarek] Pipeline completed with errors-
WARN: Killing running tasks (17)

ERROR ~ Cannot invoke "org.iq80.leveldb.impl.Version.retYYn()" because "this.version" is null
Teezi commented 11 months ago

I am using the new version too (3.3.2), and have a similar problem as @bounlu, please find below

[91/c9321b] NOTE: Process NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:ASSESS_SIGNIFICANCE (AUD220512001_tumor_vs_AUD210903021_normal) terminated with an error exit status (1) -- Execution is retried (1) ERROR ~ Error executing process > 'NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:MAKEGRAPH (AUD220512001_tumor_vs_AUD210903021_normal)'

Caused by: Process NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:MAKEGRAPH (AUD220512001_tumor_vs_AUD210903021_normal) terminated with an error exit status (1)

Command executed:

cat $(which makeGraph.R) | R --slave --args 2 AUD220512001_tumor_vs_AUD210903021_normal.tumor.mpileup.gz_normal_ratio.txt AUD220512001_tumor_vs_AUD210903021_normal.tumor.mpileup.gz_ratio.txt AUD220512001_tumor_vs_AUD210903021_normal.normal.mpileup.gz_BAF.txt AUD220512001_tumor_vs_AUD210903021_normal.tumor.mpileup.gz_BAF.txt

mv _BAF.txt.png AUD220512001_tumor_vs_AUD210903021_normal_BAF.png mv _ratio.txt.log2.png AUD220512001_tumor_vs_AUD210903021_normal_ratio.log2.png mv *_ratio.txt.png AUD220512001_tumor_vs_AUD210903021_normal_ratio.png

cat <<-END_VERSIONS > versions.yml "NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_SOMATIC_ALL:BAM_VARIANT_CALLING_SOMATIC_CONTROLFREEC:MAKEGRAPH": controlfreec: $(echo $(freec -version 2>&1) | sed 's/^.Control-FREEC //; s/:.$//' | sed -e "s/Control-FREEC v//g" ) END_VERSIONS

Command exit status: 1

Command output: null device 1 null device 1

Command error: Warning message: In type.convert.default(args[4]) : 'as.is' should be specified by the caller; using TRUE There were 50 or more warnings (use warnings() to see the first 50) Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ Calls: plot -> plot.default -> xy.coords Execution halted

Work dir: /fs04/bv48/workspace-nobackup/maria/sarek/jGCT/2_variantCalling/work/48/fd8e1287759ace6bd5f2685657448d

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check '.nextflow.log' file for details

FriederikeHanssen commented 8 months ago

Hey! Sorry for the delay. Finally had some time to dig into the ASSESS_SIGNIFICANCE error. I think there are two things:

  1. We should only use tumor_CNVS and tumor_ratio.txt, so will fix that.

and

  1. there seems to be an issue with the R script itself. When doing 1. manually I ran into the issue described here: https://github.com/BoevaLab/FREEC/issues/127, updating the script as proposed seems to work.
nschcolnicov commented 6 months ago

@bounlu @Teezi This PR fixed the issue for me, could you please check if you are still getting the error? Else we can close this issue

bounlu commented 1 month ago

PR fixed the issue, closed.