nf-core / circrna

circRNA quantification, differential expression analysis and miRNA target prediction of RNA-Seq data
https://nf-co.re/circrna
MIT License
43 stars 21 forks source link

ERROR ~ Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION' #150

Open ZabalaAitor opened 2 weeks ago

ZabalaAitor commented 2 weeks ago

Description of the bug

Hello,

I am trying to run nf-core/circRNA on sncRNA samples, and I encountered an error during the annotation part for some of the samples. I noticed that the samples with errors have an empty intersect.bed file.

I am wondering what information is supposed to be in the intersect.bed file and what biological reasons could cause it to be empty.

Thank you very much,

Aitor Zabala

Command used and terminal output

nextflow run nf-core/circRNA \
    -r dev \
    -profile apptainer \
    --input /data/azabala/NIM_005/samplesheet.csv \
    --phenotype /data/azabala/NIM_005/phenotype.csv \
    --module circrna_discovery,mirna_prediction \
    --outdir /scratch/azabala/sncRNA/results_circRNA \
    --tool 'circrna_finder' \
    --max_cpus 36 \
    --max_memory 512GB \
    -w /scratch/azabala/work_sncRNA_circRNA \
    --genome GRCh38 \
    --save_reference false \
    -resume

...............................

Caused by:
  Process `NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION (HC19)` terminated with an error exit status (1)

Command executed:

  annotation.py --input HC19.intersect.bed --exon_boundary 200 --output HC19.annotation.bed

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:ANNOTATION":
      python: $(python --version | sed 's/Python //g')
      pandas: $(python -c "import pandas; print(pandas.__version__)")
      numpy: $(python -c "import numpy; print(numpy.__version__)")
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/home/azabala/.nextflow/assets/nf-core/circRNA/bin/annotation.py", line 55, in <module>
      df = df.groupby(['chr', 'start', 'end', 'strand']).aggregate({
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/groupby/generic.py", line 894, in aggregate
      result = op.agg()
               ^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 169, in agg
      return self.agg_dict_like()
             ^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 478, in agg_dict_like
      arg = self.normalize_dictlike_arg("agg", selected_obj, arg)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.11/site-packages/pandas/core/apply.py", line 601, in normalize_dictlike_arg
      raise KeyError(f"Column(s) {cols_sorted} do not exist")
  KeyError: "Column(s) ['gene_id', 'transcript_id'] do not exist"

Work dir:
  /scratch/azabala/work_sncRNA_circRNA/83/3b958d1d7194efaa23a82450c6e7f5

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Relevant files

No response

System information

Nextflow: 23.04.2 Hardware: HPC Executor: slurm Conatiner: Apptainer OS: Linux nf-core/circrna: dev

nictru commented 2 weeks ago

Hey, This happens if the GTF file does not meet the expectations. In this case, the gene_id and transcript_id fields in the attributes column are missing. Please make sure to use an appropriate GTF file. Also the pipeline version seems to be a bit outdated - please update using nextflow pull nf-core/circrna

ZabalaAitor commented 2 weeks ago

Hey,

I used the default GTF file provided by eGenomes, which I believe should have the correct format. Regarding the pipeline, I did update it using nextflow pull nf-core/circrna, but it's possible that the update didn't complete properly due to issues with the HPC environment. I'll look into it to ensure the pipeline is fully updated.

Thanks,

nictru commented 2 weeks ago

I am sure the GTF will have the correct format; otherwise, errors will look different. The problem occurs because the GTF contains regions on sequences not present in the FASTA file.

This problem will also occur on the latest pipeline version, as I have not yet had time to fix it - this was just a side note.

EDIT: This message was a mixup - forget about it

ZabalaAitor commented 2 weeks ago

The FASTA file is also provided by eGenomes...

nictru commented 2 weeks ago

Oh I'm sorry, I got mixed up between two issues. This issue does not have anything to do with the FASTA file. The one with the FASTA file compatibility problems is #151.

Still, the error you encounter is due to missing gene_id and transcrip_id entries in the GTF file. nf-core also discourages the usage of iGenomes as stated here. Maybe look inside the GTF file and see for yourself, but I can also add a check to the pipeline, which will give a user-friendly message if this happens again. To fix this I can recommend reference data from here.

ZabalaAitor commented 1 week ago

I tried using another GTF file and encountered an error while running CIRIquant because it is unable to find the GTF file, whereas other tools, such as circRNA_finder, are able to do.

I have written about the issue in #155 . Please feel free to delete or close that entry if you prefer to resolve the issue here.

Thank you very much for your time and assistance.

ZabalaAitor commented 6 days ago

This error persists despite using different GTF files. Could it be because there are no circRNAs in those samples?

nictru commented 6 days ago

You are absolutely right, this can also occur if no circRNAs are found. I should have thought about this earlier. You can confirm this is the case by switching to /scratch/azabala/work_sncRNA_circRNA/83/3b958d1d7194efaa23a82450c6e7f5 and investigating the GTF file there.

If it is really the case, I will implement a clear error message pointing this out for future users.

ZabalaAitor commented 5 days ago

I cannot find the GTF file in that directory, but the intersect.bed file is empty.

nictru commented 5 days ago

Yes okay, this is the reason then. Is the data you used confidential? Otherwise I would like to use it as test data for coming up with a clean solution