nf-core / circrna

circRNA quantification, differential expression analysis and miRNA target prediction of RNA-Seq data
https://nf-co.re/circrna
MIT License
43 stars 21 forks source link

Issues generating, and pointing to, genome reference data lead to early crashes during configuration (w/ fix(?)) #74

Closed rreggiar closed 9 months ago

rreggiar commented 10 months ago

Description of the bug

Intro

Thank you for the effort of putting this together -- combining many tools seems in line with the consensus of the field and is a lot of work. Unfortunately, 60cbad737a7db28ddd0399bf48d399d076ec5e3d does not work (on my system; below) outside of test profile. Based on other issues (#68 , #70 seems potentially related as well) I believe this is a general error in the construction of reference pointers in the configuration. For each scenario below I have uploaded the command and nextflow log in this issue where requested. This may no longer be actively maintained but I hope to have a working solution for others in my situation.

test_full

As detailed in #68 , this results in a null assignment for a process path. This is difficult to debug as a user because the error happens before the work directory is populated with anything, suggesting its a problem in the config setup.

user data (w/ remote igenomes)

As with above , it appears that the igenomes S3 sync is unsuccessful and not mapping a path to the process, resulting in null paths for every process and failure (thought slightly different).

user data (w/ local igenomes)

with:

aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg38/ /reference/hg38/

I synced the igenome directory to a local path ($REF_PATH), paths still come up as 'null' and things break during conf however genome.fa is now located and split.

user data (w/ hard-coded genome params)

(See command below) New error: Cannot get property 'fasta' on null object with a stdout log suggesting to: -- Check script './workflows/circrna.nf' at line: 48 or see '.nextflow.log' file for more details Ternary param assignment in workflows/circrna.nf [starts @ line 48] leads to empty param paths which can be fixed by commenting out lines 48 - 55, avoiding the broken reassignment of genome params based on the igenome object that 1) doesn't exist in this use case 2) seemed to fail in the others anyways (?):

// Genome params
// params.fasta   = params.genome  ? params.genomes[ params.genome ].fasta ?: false : false
// params.gtf     = params.genome  ? params.genomes[ params.genome ].gtf ?: false : false
// params.bwa     = params.genome && params.tool.contains('ciriquant') ? params.genomes[ params.genome ].bwa ?: false : false
// params.star    = params.genome && ( params.tool.contains('circexplorer2') || params.tool.contains('dcc') || params.tool.contains('circrna_finder') ) ? params.genomes[ params.genome ].star ?: false : false
// params.bowtie  = params.genome && params.tool.contains('mapsplice') ? params.genomes[ params.genome ].bowtie ?: false : false
// params.bowtie2 = params.genome && params.tool.contains('find_circ') ? params.genomes[ params.genome ].bowtie2 ?: false : false
params.mature  = params.genome && params.module.contains('mirna_prediction') ? params.genomes[ params.genome ].mature ?: false : false
// params.species = params.genome  ? params.genomes[ params.genome ].species_id ?: false : false

conclusion

I am currently running a succesful instance of the pipeline with hard-coded genome params that are not reassigned thanks to the commented lines above

I wanted to get this up once I knew the pipeline was working, so while I'm certain there is a way to meaningfully fix the param reassignment I disable in the fix above (e.g. have ternary set to the existing param if false? skip if igenomes_ignore==TRUE?, debug the igenomes config object generation?) I just hacked this into functional shape and will update on the overall success of the run.

Command used and terminal output

# test_full
nextflow run $OUTPUT_PATH/nf-core-circrna_dev/dev \
    -profile test_full,singularity \
    --input "$OUTPUT_PATH/data/samplesheet.csv" \
    --outdir "$OUTPUT_PATH/data/results/" \
    --module "circrna_discovery" \
    --tool 'ciriquant,circexplorer2,find_circ,circrna_finder' \
    --bsj_reads 2
# errors (see testFull.nextflow.log):
ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:PREPARE_GENOME:HISAT2_EXTRACTSPLICESITES'

Caused by:
  Not a valid path value: 'null'

#######################

# user data, remote igenome
nextflow run $OUTPUT_PATH/nf-core-circrna_dev/dev \
    -profile singularity \
    --input "$OUTPUT_PATH/data/samplesheet.csv" \
    --outdir "$OUTPUT_PATH/data/results/" \
    --genome "hg38" \
    --module "circrna_discovery" \
    --tool 'ciriquant,circexplorer2,find_circ,circrna_finder' \
    --bsj_reads 2
# errors (see iGenome.nextflow.log):
ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:CIRCEXPLORER2_REF'
# plus a bunch more 'null' paths

Caused by:
  Not a valid path value: 'null'

#######################

# user data, local igenome
nextflow run $OUTPUT_PATH/nf-core-circrna_dev/dev \
    -profile singularity \
    --input "$OUTPUT_PATH/data/samplesheet.csv" \
    --outdir "$OUTPUT_PATH/data/results/" \
    --genome "hg38" \
    --igenomes_base "$REF_PATH" \
    --module "circrna_discovery" \
    --tool 'ciriquant,circexplorer2,find_circ,circrna_finder' \
    --bsj_reads 2
# errors (see localGenome.nextflow.log):
ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:MAPSPLICE_REFERENCE'
...
Sep-21 12:23:00.324 [Actor Thread 21] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Sep-21 12:23:00.331 [Actor Thread 9] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_CIRCRNA:CIRCRNA:INPUT_CHECK:SAMPLESHEET_CHECK; work-dir=null
  error [java.lang.InterruptedException]: java.lang.InterruptedException
Sep-21 12:23:00.341 [Actor Thread 2] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_CIRCRNA:CIRCRNA:PREPARE_GENOME:BOWTIE2_BUILD; work-dir=null
  error [java.lang.InterruptedException]: java.lang.InterruptedException
Sep-21 12:23:00.341 [Actor Thread 10] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_CIRCRNA:CIRCRNA:PREPARE_GENOME:BWA_INDEX; work-dir=null
  error [java.lang.InterruptedException]: java.lang.InterruptedException
Sep-21 12:23:00.353 [Actor Thread 10] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=NFCORE_CIRCRNA:CIRCRNA:CIRCRNA_DISCOVERY:STAR_1ST_PASS; work-dir=null
  error [java.lang.InterruptedException]: java.lang.InterruptedException

#######################

# user data, hard coded 
nextflow run $OUTPUT_PATH/nf-core-circrna_dev/dev \
    -profile singularity \
    --input "$OUTPUT_PATH/data/samplesheet.csv" \
    --outdir "$OUTPUT_PATH/data/results/" \
    --genome "hg38" \
    --igenomes_ignore "true" \
    --fasta "$REF_PATH/Sequence/WholeGenomeFasta/genome.fa" \
    --bowtie2 "$REF_PATH/Sequence/Bowtie2Index/" \
    --bowtie "$REF_PATH/Sequence/BowtieIndex/" \
    --bwa "$REF_PATH/Sequence/BWAIndex/version0.6.0/" \
    --star "$REF_PATH/Sequence/STARIndex/" \
    --gtf "$REF_PATH/Annotation/Genes/genes.gtf" \
    --species "hsa" \
    --module "circrna_discovery" \
    --tool 'ciriquant,circexplorer2,find_circ,circrna_finder' \
    --bsj_reads 2
# errors (see hardGenome.nextflow.log):
nextflow.Session - Session aborted -- Cause: Cannot get property 'fasta' on null object

Relevant files

hardGenome.nextflow.log iGenome.nextflow.log localGenome.nextflow.log testFull.nextflow.log

System information

nictru commented 10 months ago

Hey, I just started working on this pipeline today, but I have great interest in making it work properly. I had a look at #68 and got it to work by updating the igenomes config. With #77 i opened an according PR.

However, since I am not a maintainer of this repository, the changes I made are only available on the update-igenomes branch so far. You would do me a great favor if you could try if the changes also solve your problem (using the integrated igenomes).

rreggiar commented 9 months ago

This run isn't complete yet but this appears to have solved the reference failures. Was config missing indentation? Looking through the PR it seems that was the major change

nictru commented 9 months ago

The indentation was more a byproduct of updating the igenomes config - nextflow is not indentation sensitive. The reason I did this was mainly that the R64-1-1 genome that was used in the test_full configuration was not fully supported through the old version. Depending on how exotic the genome is you are using, this may or may not have affected you.

Another thing that I encountered is, that the gtf config when set via igenomes was overwritten by the pipeline default null value. Fixed this here.

In case you encounter any further problems feel free to contact me again :)