nf-core / circrna

circRNA quantification, differential expression analysis and miRNA target prediction of RNA-Seq data
https://nf-co.re/circrna
MIT License
44 stars 21 forks source link

STAR, CIRIquant, DCC errors during pipeline run #89

Closed dmgie closed 4 months ago

dmgie commented 8 months ago

Description of the bug

Hiya, thank you for the work on the pipeline! Currently, when I try to run the pipeline using my own (paired-end) data, it seems that there are a few steps in the pipeline in which it fails and exits. When going through the test run/profile though (using the test profile i.e nextflow run nf-core/circrna -c ./hpc.config -profile test,singularity -r dev -ansi-log false -resume) it seems to work fine and the pipeline completes.

The first issue that arose was regarding STAR. If it uses the genome: GRCh37 parameter, from what I understand this obtains the necessary fies/indices from iGenome. The issue is that when it reaches the mapping step prior to DCC, it fails due to Genome & STAR version incompatibility (STAR output below). The image used for this step seems to contain STAR version 2.7.10a, whereas Genome was generated with 2.7.4a, so could be a need to downgrade the image to a older STAR version? [*1]

Alternatively, I saw that I can provide my own fasta/gtf (and also the required species) parameter, so I tried it using the files from Ensembl (https://grch37.ensembl.org/Homo_sapiens/Info/Index). This seemed to work fine, but during DCC’s execution results in a ValueError: invalid literal for int() with base 10: '4"' error (more details below). From what I have found so far is that the GTF doesn't get parsed correctly by the Circ_nonCirc_Exon_Match.py functions of DCC/circtools. Installing and running circtools detect/DCC with the same files seems to work fine.

There was another error I had run into when trying to add/use ciriquant as a tool which errored out with CIRIquant.utils.PipelineError: Empty hisat2 bam generated, please re-run CIRIquant with -v and check the fastq and hisat2-index. Re-running this via bash .command.run results in the same error. If I try on the other hand launching the singularity image myself and run the commands i.e

singularity exec --no-home --pid -B <path_to_folder>/nf-core-testing <path_to_folder>/nf-core-testing/tmp/depot.galaxyproject.org-singularity-ciriquant-1.1.2--pyhdfd78af_2.img bash <path_to_folder>/nf-core-testing/work/7b/c6590863cfa52ce00059592e7f0d89/.command.sh

works fine and runs.

I have copied the errors to the box below. The command that was run (which produced the errors)is: nextflow run nf-core/circrna -c ./hpc.config -params-file ./params.yaml -profile singularity -r dev -ansi-log false -resume. Do let me know if there is anything I can help with.

On a sidenote: in the targetscan_format.sh script, its mentioned in a comment that Subset mature.fa according to the species provided by user to '--genome' but from briefly looking around wasn't able to find where this might be included in the pipeline?

[*1] Tried using a custom image with a downgraded STAR version, still get the same error

  EXITING because of FATAL ERROR: Genome version: 20201 is INCOMPATIBLE with running STAR version: 2.7.4a
  SOLUTION: please re-generate genome from scratch with running version of STAR, or with version: 2.7.4a

Command used and terminal output

STAR

    EXITING because of FATAL ERROR: Genome version: 20201 is INCOMPATIBLE with running STAR version: 2.7.10a
    SOLUTION: please re-generate genome from scratch with running version of STAR, or with version: 2.7.4a

or the full output

    ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC_MATE2_1ST_PASS (sample1)'

    Caused by:
      Process `NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC_MATE2_1ST_PASS (sample1)` terminated with an error exit status (105)

    Command executed:

      STAR \
          --genomeDir STARIndex \
          --readFilesIn input1/sample1_2_val_2.fq.gz  \
          --runThreadN 12 \
          --outFileNamePrefix sample1_mate2. \
          --outSAMtype BAM Unsorted \
           \
          --outSAMattrRGline 'ID:sample1_mate2'  'SM:sample1_mate2'  \
          --chimOutType Junctions WithinBAM --outSAMunmapped Within --outFilterType BySJout --outReadsUnmapped None --readFilesCommand zcat --alignSJDBoverhangMin 10 --chimJunctionOverhangMin 10 --chimSegmentMin 10

      if [ -f sample1_mate2.Unmapped.out.mate1 ]; then
          mv sample1_mate2.Unmapped.out.mate1 sample1_mate2.unmapped_1.fastq
          gzip sample1_mate2.unmapped_1.fastq
      fi
      if [ -f sample1_mate2.Unmapped.out.mate2 ]; then
          mv sample1_mate2.Unmapped.out.mate2 sample1_mate2.unmapped_2.fastq
          gzip sample1_mate2.unmapped_2.fastq
      fi

      cat <<-END_VERSIONS > versions.yml
      "NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC_MATE2_1ST_PASS":
          star: $(STAR --version | sed -e "s/STAR_//g")
          samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
          gawk: $(echo $(gawk --version 2>&1) | sed 's/^.*GNU Awk //; s/, .*$//')
      END_VERSIONS

    Command exit status:
      105

    Command output:
            STAR --genomeDir STARIndex --readFilesIn input1/sample1_2_val_2.fq.gz --runThreadN 12 --outFileNamePrefix sample1_mate2. --outSAMtype BAM Unsorted --outSAMattrRGline ID:sample1_mate2 SM:sample1_mate2 --chimOutType Junctions WithinBAM --outSAMunmapped Within --outFilterType BySJout --outReadsUnmapped None --readFilesCommand zcat --alignSJDBoverhangMin 10 --chimJunctionOverhangMin 10 --chimSegmentMin 10
            STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
      Jan 18 13:24:45 ..... started STAR run
      Jan 18 13:24:45 ..... loading genome

    Command error:
      INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
      INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
      INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred

      EXITING because of FATAL ERROR: Genome version: 20201 is INCOMPATIBLE with running STAR version: 2.7.10a
      SOLUTION: please re-generate genome from scratch with running version of STAR, or with version: 2.7.4a

      Jan 18 13:24:45 ...... FATAL ERROR, exiting

CIRIquant


    ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:CIRIQUANT (sample1)'

    Caused by:
      Process `NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:CIRIQUANT (sample1)` terminated with an error exit status (1)

    Command executed:

      CIRIquant \
          -t 36 \
          -1 sample1_1_val_1.fq.gz \
          -2 sample1_2_val_2.fq.gz \
          --config travis.yml \
          --no-gene \
          -o sample1 \
          -p sample1

      cat <<-END_VERSIONS > versions.yml
      "NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:CIRIQUANT":
          bwa: $(echo $(bwa 2>&1) | sed 's/^.*Version: //; s/Contact:.*$//')
          ciriquant : $(echo $(CIRIquant --version 2>&1) | sed 's/CIRIquant //g' )
          samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
          stringtie: $(stringtie --version 2>&1)
          hisat2: 2.1.0
      END_VERSIONS

    Command exit status:
      1

    Command output:
      (empty)

    Command error:
      INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
      INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
      INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
      [Thu 2024-01-18 13:38:50] [INFO ] Input reads: sample1_1_val_1.fq.gz,sample1_2_val_2.fq.gz
      [Thu 2024-01-18 13:38:50] [INFO ] Library type: unstranded
      [Thu 2024-01-18 13:38:50] [INFO ] Output directory: sample1, Output prefix: sample1
      [Thu 2024-01-18 13:38:50] [INFO ] Config: ciriquant Loaded
      [Thu 2024-01-18 13:38:50] [INFO ] 256 CPU cores availble, using 36
      [Thu 2024-01-18 13:38:50] [INFO ] Align RNA-seq reads to reference genome ..
      Traceback (most recent call last):
        File "/usr/local/bin/CIRIquant", line 10, in <module>
          sys.exit(main())
        File "/usr/local/lib/python2.7/site-packages/CIRIquant/main.py", line 155, in main
          hisat_bam = pipeline.align_genome(log_file, thread, reads, outdir, prefix)
        File "/usr/local/lib/python2.7/site-packages/CIRIquant/pipeline.py", line 52, in align_genome
          raise utils.PipelineError('Empty hisat2 bam generated, please re-run CIRIquant with -v and check the fastq and hisat2-index.')
      CIRIquant.utils.PipelineError: Empty hisat2 bam generated, please re-run CIRIquant with -v and check the fastq and hisat2-index.

DCC (own fasta/gtf)

    ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC (sample1)'

    Caused by:
      Process `NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC (sample1)` terminated with an error exit status (1)

    Command executed:

      sed -i 's/^chr//g' Homo_sapiens.GRCh37.87.gtf

      mkdir sample1 && mv sample1.Chimeric.out.junction sample1 && printf "sample1/sample1.Chimeric.out.junction" > samplesheet
      mkdir sample1_mate1 && mv sample1_mate1.Chimeric.out.junction sample1_mate1 && printf "sample1_mate1/sample1_mate1.Chimeric.out.junction" > mate1file
      mkdir sample1_mate2 && mv sample1_mate2.Chimeric.out.junction sample1_mate2 && printf "sample1_mate2/sample1_mate2.Chimeric.out.junction" > mate2file

      DCC @samplesheet -mt1 @mate1file -mt2 @mate2file -D -an Homo_sapiens.GRCh37.87.gtf -Pi -ss -F -M -Nr 1 1 -fg -A Homo_sapiens.GRCh37.dna.primary_assembly.fa -N -T 12

      awk '{print $6}' CircCoordinates >> strand
      paste CircRNACount strand | tail -n +2 | awk -v OFS=" " '{print $1,$2,$3,$5,$4}' >> 20096b003L2_Q001H283AC.txt

      cat <<-END_VERSIONS > versions.yml
      "NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC":
          dcc: $(DCC --version)
      END_VERSIONS

    Command exit status:
      1

    Command output:
      Output folder ./ already exists, reusing
      DCC 0.5.0 started
      256 CPU cores available, using 12
      WARNING: non-stranded data, the strand of circRNAs guessed from the strand of host genes
      Please make sure that the read pairs have been mapped both, combined and on a per mate basis
      Collecting chimera information from mates-separate mapping
      Combining individual circRNA read counts
      Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering
      Filtering by read counts
      Remove ChrM
      Count CircSkip junctions

    Command error:
      INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
      INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
      INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
      Traceback (most recent call last):
        File "/usr/local/bin/DCC", line 10, in <module>
          sys.exit(main())
        File "/usr/local/lib/python3.10/site-packages/DCC/main.py", line 490, in main
          CircSkipfiles = findCircSkipJunction(output_coordinates, options.tmp_dir,
        File "/usr/local/lib/python3.10/site-packages/DCC/main.py", line 679, in findCircSkipJunction
          circStartAdjacentExons, circStartAdjacentExonsIv = CCEM.findcircAdjacent(circStartExons, Custom_exon_id2Iv,
        File "/usr/local/lib/python3.10/site-packages/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
          interval = Custom_exon_id2Iv[self.getAdjacent(ids, start=start)]
        File "/usr/local/lib/python3.10/site-packages/DCC/Circ_nonCirc_Exon_Match.py", line 222, in getAdjacent
          exon_number = int(custom_exon_id.split(':')[1]) - 1
      ValueError: invalid literal for int() with base 10: '4"'

Relevant files

No response

System information

Nextflow Version: 23.10.0 Hardware: HPC/Cluster Executor: Slurm Container: Singularity OS: Ubuntu nf-core/circrna version: dev

nictru commented 8 months ago

STAR

I encountered this problem before, but had no time to fix it yet. The problem is that the iGenomes STAR index is not compatible with the STAR version used in the pipeline. A workaround is setting manually setting star = null, which prevents usage of iGenomes and thus forces the pipeline to build an own index. Opened #91 for this.

CIRIquant

This is most probably due to missing escape characters in nextflow scripts, will be fixed via #83

DCC

Could potentially also be fixed via #83

What can you do now?

I expect #83 to be merged into dev within the next days, if you want to try it faster, you can use the caching branch of the pipeline (nextflow run -r caching ...). Keep me posted, in case something is working better then. If the problems (especially with DCC) persist, you could also try to fix the scripts yourself and open a PR.

MariekeVromman commented 8 months ago

Hi, just letting you know I ran with the merged bug fix #83, and also got a similar DCC error.

ValueError: invalid literal for int() with base 10: '2"'

details:

-[nf-core/circrna] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC (control_1)'

Caused by:
  Process `NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC (control_1)` terminated with an error exit status (1)

Command executed:

  sed -i 's/^chr//g' gencode.v44.chr_patch_hapl_scaff.annotation.gtf

  mkdir control_1 && mv control_1.Chimeric.out.junction control_1 && printf "control_1/control_1.Chimeric.out.junction" > samplesheet
  mkdir control_1_mate1 && mv control_1_mate1.Chimeric.out.junction control_1_mate1 && printf "control_1_mate1/control_1_mate1.Chimeric.out.junction" > mate1file
  mkdir control_1_mate2 && mv control_1_mate2.Chimeric.out.junction control_1_mate2 && printf "control_1_mate2/control_1_mate2.Chimeric.out.junction" > mate2file

  DCC @samplesheet -mt1 @mate1file -mt2 @mate2file -D -an gencode.v44.chr_patch_hapl_scaff.annotation.gtf -Pi -ss -F -M -Nr 1 1 -fg -A GRCh38.p14.genome.fa -N -T 12

  awk '{print $6}' CircCoordinates >> strand
  paste CircRNACount strand | tail -n +2 | awk -v OFS="\t" '{print $1,$2,$3,$5,$4}' >> control_1.txt

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:DCC":
      dcc: $(DCC --version)
  END_VERSIONS

Command exit status:
  1

Command output:
  Output folder ./ already exists, reusing
  DCC 0.5.0 started
  24 CPU cores available, using 12
  WARNING: non-stranded data, the strand of circRNAs guessed from the strand of host genes
  Please make sure that the read pairs have been mapped both, combined and on a per mate basis
  Collecting chimera information from mates-separate mapping
  Combining individual circRNA read counts
  Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering
  Filtering by read counts
  Remove ChrM
  Count CircSkip junctions
  started circRNA detection from file _tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G
    => separating duplicates [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
    => locating small circRNAs [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
    => locating circRNAs (unstranded mode) [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
    => merging circRNAs [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
    => sorting circRNAs (unstranded mode) [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
  finished circRNA detection from file _tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G

Command error:
  Unable to find image 'quay.io/biocontainers/circtools:1.2.1--pyh7cba7a3_0' locally
  1.2.1--pyh7cba7a3_0: Pulling from biocontainers/circtools
  73349e34840e: Already exists
  acab339ca1e8: Already exists
  425fd6205dc3: Pulling fs layer
  425fd6205dc3: Download complete
  425fd6205dc3: Pull complete
  Digest: sha256:7317627874031c4c9924d40b76602662a2d400c9ee4c1c626998c287e5e7bd65
  Status: Downloaded newer image for quay.io/biocontainers/circtools:1.2.1--pyh7cba7a3_0
  Output folder ./ already exists, reusing
  DCC 0.5.0 started
  24 CPU cores available, using 12
  Traceback (most recent call last):
    File "/usr/local/bin/DCC", line 10, in <module>
  WARNING: non-stranded data, the strand of circRNAs guessed from the strand of host genes
  Please make sure that the read pairs have been mapped both, combined and on a per mate basis
  Collecting chimera information from mates-separate mapping
  Combining individual circRNA read counts
  Using files _tmp_DCC/tmp_circCount and _tmp_DCC/tmp_coordinates for filtering
  Filtering by read counts
  Remove ChrM
  Count CircSkip junctions
      sys.exit(main())
    File "/usr/local/lib/python3.10/site-packages/DCC/main.py", line 490, in main
      CircSkipfiles = findCircSkipJunction(output_coordinates, options.tmp_dir,
    File "/usr/local/lib/python3.10/site-packages/DCC/main.py", line 679, in findCircSkipJunction
      circStartAdjacentExons, circStartAdjacentExonsIv = CCEM.findcircAdjacent(circStartExons, Custom_exon_id2Iv,
    File "/usr/local/lib/python3.10/site-packages/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
      interval = Custom_exon_id2Iv[self.getAdjacent(ids, start=start)]
    File "/usr/local/lib/python3.10/site-packages/DCC/Circ_nonCirc_Exon_Match.py", line 222, in getAdjacent
      exon_number = int(custom_exon_id.split(':')[1]) - 1
  ValueError: invalid literal for int() with base 10: '2"'
  started circRNA detection from file _tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G
    => separating duplicates [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
    => locating small circRNAs [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
    => locating circRNAs (unstranded mode) [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
    => merging circRNAs [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
    => sorting circRNAs (unstranded mode) [_tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G]
  finished circRNA detection from file _tmp_DCC/control_1.Chimeric.out.junction.4ZYC4G

Work dir:
  /Users/marieke/Documents/work/66/1152ceffd34d8324ab17adcbacdcf6
nictru commented 5 months ago

Hey, @dmgie, could you maybe check if your errors do still occur with the latest version of the pipeline? There were some PRs fixed in the meantime and I am not sure if they also adressed your problems

The DCC issue looks like there is an additional " somewhere, we might be able to clean this away

dmgie commented 4 months ago

Hi @nictru, sorry for the late reply! I would've been happy to test it but sadly I do not have access anymore to the data I had initially run the pipeline with (which had lead to the errors mentioned in the first post). I could possibly test it with some other data, but I'm currently not working on projects related to circRNA-related anlaysis anymore so I can't promise to be able to test it anytime soon. If I do, I'll try and report back here in case there are any changes. Would you want me to close the issue meanwhile or should it be left open?

nictru commented 4 months ago

Hey, no problem - I will close the issue in this case, feel free to open a new one if you encounter new problems some time in the future :)