nf-core / circrna

circRNA quantification, differential expression analysis and miRNA target prediction of RNA-Seq data
https://nf-co.re/circrna
MIT License
45 stars 24 forks source link

GTF entries on sequence which is not present in FASTA #151

Closed yshen16 closed 4 months ago

yshen16 commented 5 months ago

Description of the bug

recently I tried to help a user to run the nf-core/circRNA pipeline, but met tons of errors. And this is one of the latest after I downloaded the reference genome from aws, and rerun everything from scratch, that is, I removed all the residual ~/.nextflow, ~/.singularity and all that in my working directory when I run the command. Here is the command line I ran:

module purge module load java/17.0.8 module load nextflow/23.04.3 # old 21.10.6 doesn't work

#Set the new Singularity cache directory path export SING_DIR=/projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3

# Export the NXF_SINGULARITY_CACHEDIR environment variable to point to the new cache directory export NXF_SINGULARITY_CACHEDIR=$SING_DIR/.singularity

# Create the new Singularity cache directory if it doesn't already exist mkdir -p $NXF_SINGULARITY_CACHEDIR

# Remove the existing ~/.singularity symbolic link rm -rf ~/.singularity

# Create a new symbolic link for ~/.singularity that points to the new cache directory ln -s $NXF_SINGULARITY_CACHEDIR ~/.singularity

cd $SING_DIR

# Yun Test_rerun on 06/16/2024: nextflow run nf-core/circrna -profile singularity --genome GRCh37 --igenomes_base /projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/ref/references --input samples.csv --module 'circrna_discovery' --outdir test3_out -r dev

And here is my local references folder structure: ./Homo_sapiens ./Homo_sapiens/Ensembl ./Homo_sapiens/Ensembl/GRCh37 ./Homo_sapiens/Ensembl/GRCh37/Annotation ./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA ./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA/mature.fa ./Homo_sapiens/Ensembl/GRCh37/Annotation/SmallRNA/hairpin.fa ./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes ./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf ./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed ./Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/refFlat.txt.gz ./Homo_sapiens/Ensembl/GRCh37/Sequence ./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta ./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.dict ./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa ./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.index ./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa.fai ./Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/GenomeSize.xml ./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex ./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.rev.2.ebwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.2.ebwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.1.ebwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.fa ./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.3.ebwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.4.ebwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/BowtieIndex/genome.rev.1.ebwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrName.txt ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/exonInfo.tab ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/SAindex ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/Genome ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/geneInfo.tab ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbList.out.tab ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbInfo.txt ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/transcriptInfo.tab ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/sjdbList.fromGTF.out.tab ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genome.fa ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrLength.txt ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/Log.out ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genes.gtf ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/genomeParameters.txt ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrStart.txt ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/SA ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/chrNameLength.txt ./Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/exonGeTrInfo.tab ./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index ./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.3.bt2 ./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.1.bt2 ./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.rev.1.bt2 ./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.fa ./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.4.bt2 ./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.2.bt2 ./Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome.rev.2.bt2 ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.pac ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.ann ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0 ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.pac ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.ann ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.sa ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.amb ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.6.0/genome.fa.bwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.pac ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.ann ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rpac ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rsa ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.sa ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.amb ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.rbwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/version0.5.x/genome.fa.bwt ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.sa ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.amb ./Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa.bwt

and the error I met is as below:

Jun-17 17:41:27.965 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 11; name: NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome); status: COMPLETED; exit: 1; error: -; workDir: /projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20] Jun-17 17:41:27.970 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for task: name=NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome); work-dir=/projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20 error [nextflow.exception.ProcessFailedException]: Process NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome) terminated with an error exit status (1) Jun-17 17:41:27.987 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome)'

Caused by: Process NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME (transcriptome) terminated with an error exit status (1)

Command executed:

gffread \ -g genome.fa \ -w transcriptome.fasta \ \ transcriptome.filtered.gtf

cat <<-END_VERSIONS > versions.yml "NFCORE_CIRCRNA:CIRCRNA:QUANTIFICATION:TRANSCRIPTOME": gffread: $(gffread --version 2>&1) END_VERSIONS

Command exit status: 1

Command output: (empty)

Command error: FASTA index file genome.fa.fai created. Warning: couldn't find fasta record for 'GL000191.1'! Error: no genomic sequence available (check -g option!).

Work dir: /projectnb/zhangh/Arshiya_Saxena/HumanBrain_circRNA/nf-core_circrna_pipeline-NAC/yun_test3/work/82/1063296a24d1d4c319a6bd85bdeb20

Looking into a big details about 'gffread' command, I think it is expecting a 'gff' file, while the pipeline provided a 'gtf' as an input file (transcriptome.filtered.gtf) and would this be the cause of the error? if so, how could this happen? is there anything missing in my command line? We used to be able to run the pipeline around Oct 2023, or maybe even this April, but now everything seems broken.

Please help. Thank you very much!!!

nictru commented 5 months ago

Hey - thanks for the issue. The error occurs in a relatively new pipeline section which probably was not there when you last executed the pipeline. It was tested a bit already, but there are always edge cases that have not been considered yet.

Breaking this down

  1. The TRANSCRIPTOME process tries to extract the sequences of the detected circRNAs into a FASTA file based on an input GTF file. This GTF file is pipeline-internal and not provided by you.
  2. The gffread command is used for this and receives the FASTA file (genome.fa) and the transcriptome GTF file (transcriptome.filtered.gtf) as input. The differences between GTF and GFF should not be relevant here, works in other test cases.
  3. The error says, Warning: couldn't find fasta record for 'GL000191.1'!. This suggests that there are entries in the GTF file that are located on a chromosome that are not present in the FASTA file. So now the question arises: How can we detect circRNAs that are not present in the provided reference genome sequence?
  4. Answer: They can't. But in the QUANTIFICATION subworkflow, a common quantification of the circular and linear transcriptome is performed (the goal is to allow unbiased correlation analyses). So before the GFFREAD step, the detected circRNAs are merged with all the linear transcript locations you provide in the input GTF file. Most likely, the problematic regions originate from the input GTF file.

TL,DR:

The GTF file you provide as input most likely contains entries that are not present in the input FASTA file.

How to fix?

This is something that the pipeline should be able to handle, and it is not complicated to implement. Can probably do this later this week. If you need this more urgently, you could manually remove the problematic entries from the GTF file.

Anything else?

As the pipeline is still under a lot of development, please make sure you use the latest version (nextflow pull nf-core/circrna). Recently, the module parameter has been deprecated (still present in your CLI call).

yshen16 commented 5 months ago

Thank you very much for taking time to write me back. Really appreciate it. I wonder if you can provide me some programable way to exclude those transcriptome entries in gtf file that are not in genome.fa? I feel there must be some tools to handle it. I'll look for such tool myself as well. And I'll look forward to hearing the good news from you that the pipeline is fixed.

nictru commented 5 months ago

Hey, could you check out the branch associated with this issue and give it a try? You can do this by adding -r 151-error-no-genomic-sequence-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome-error to the nextflow run command. Sorry for the long name

yshen16 commented 5 months ago

Hey, could you check out the branch associated with this issue and give it a try? You can do this by adding -r 151-error-no-genomic-sequence-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome-error to the nextflow run command. Sorry for the long name

Thank you, Nictru, I will test it out and let you know how it goes.

nictru commented 5 months ago

Any updates?

yshen16 commented 5 months ago

Sorry for the delay, and I ran the new dev version of the pipeline and here are the results:

yshen16 commented 5 months ago
  1. the test only run finished successfully: nextflow run nf-core/circrna -profile test,singularity --outdir test_afterfix -r 151-error-no-genomic-sequenc e-available-check-g-option-error-in-gffread-command-for-task-nfcore_circrnacircrnaquantificationtranscriptome- error
nictru commented 5 months ago

Okay great! Will create a PR

yshen16 commented 5 months ago

However, the actual pipeline run with the user data and configuration failed at the 'trimgalore', or more specifically, 'cutadapt' stage: error info: `ERROR: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/cutadapt/pipeline.py", line 626, in run raise e EOFError: Compressed file ended before the end-of-stream marker was reached

cutadapt: error: Compressed file ended before the end-of-stream marker was reached

Cutadapt terminated with exit signal: '256'. Terminating Trim Galore run, please check error message(s) to get an idea what went wrong... forDebug.zip ` I attached a zip file which contains the .command.sh and .command.err in it. The actual sequence files are too big to attach as a file here. If needed I can send it to you some other way.

nictru commented 5 months ago

Hey, could you verify that the corresponding input fastq file has not been corrupted? Maybe a download problem or a problem during compression?

yshen16 commented 4 months ago

Yes, you are right. It turned out the region2 read file was truncated. I will try to find if I can get the good sequence files and rerun the test. Thank you for pointing this out to me.