sanger-tol / genomenote

Nextflow DSL2 pipeline to generate a Genome Note, including assembly statistics, quality metrics, and Hi-C contact maps. This workflow is part of the Tree of Life production suite.
https://pipelines.tol.sanger.ac.uk/genomenote
MIT License
24 stars 6 forks source link

ERROR ~ toIndex = 2 -- Check script 'genomenote/./workflows/genomenote.nf' at line: 98 #87

Closed mdozmorov closed 1 month ago

mdozmorov commented 1 year ago

Description of the bug

Hi. I'm interested in trying this pipeline, but running into issues. The main one is the error immediately after running:

ERROR ~ toIndex = 2

 -- Check script 'genomenote/./workflows/genomenote.nf' at line: 98 or see '.nextflow.log' file for more details

My commands and the sample sheet are below. Thanks!

Command used and terminal output

DIRIN=/mount/home/user/data/WorkData/proj1/2023-08.HiC_test
INPUT=${DIRIN}/samplesheet_genomenote.csv
DIROUT=${DIRIN}/OUT_genomenote
GENOME=/mount/home/user/data/ExtData/UCSC/hg38/hg38.fa
nextflow run sanger-tol/genomenote --input ${INPUT} \
  --outdir ${DIROUT} \
  --fasta ${GENOME} \
  -profile singularity

Relevant files

sample,datatype,datafile sample1,hic,/mount/home/user/data/WorkData/proj1/2023-08.HiC_test/sample1/data/aligned/merged_dedup.bam sample2,hic,/mount/home/user/data/WorkData/proj1/2023-08.HiC_test/sample2/data/aligned/merged_dedup.bam

System information

Nextflow version 23.04.3 build 5875 HPC local singularity CentOS latest cloned sanger-tol/genomenote

muffato commented 1 year ago

Hi @mdozmorov . Thanks for the report.

I think this is because the pipeline assumes the input Fasta file is named with a dot in the part before the extension ā€“ something like ${A}.${B}.*. It then uses ${A}.${B} to name some output files.

We can improve that in the pipeline.

In the meantime, I believe the simplest workaround would be to name your input file something like hg38.1.fa. It looks a bit weird, but it should work šŸ¤žšŸ¼

mdozmorov commented 1 year ago

Hi @muffato, thanks for the suggestion. Yes, renaming the genome file to hg38.1.fa resolved this error. But I ran into the samplesheet validation error which may be related to file naming.

[CRITICAL] The HiC file has an unrecognized extension: /vcu_gpfs2/home/user/data/WorkData/proj1/2023-08.HiC_test/sample1/data/aligned/merged_dedup.bam
  It should be one of: .cram On line 2.

My samplesheet is above. I tested the following tweaks:

I've been checking https://raw.githubusercontent.com/sanger-tol/genomenote/main/bin/check_samplesheet.py but cannot immediately understand what may be wrong. This may be a separate issue, but could it be solved by renaming the BAM files, how?

muffato commented 11 months ago

Hi @mdozmorov . Sorry we didn't reply sooner. We've got a new release coming out very soon, and I'll make sure this bug is fixed. The problem is simply that the pipeline currently only accepts CRAM. I'll make it take BAM, which shouldn't be a problem, since internally it converts the CRAM to BAM anyway :)

mdozmorov commented 10 months ago

Noted the release, will test