nf-core / hic

Analysis of Chromosome Conformation Capture data (Hi-C)
https://nf-co.re/hic
MIT License
92 stars 55 forks source link

samplesheet check too stringent for header check #152

Open askol-lurie opened 1 year ago

askol-lurie commented 1 year ago

Description of the bug

I'm starting to use v2.0.0 of the nf-core HiC. I used the previous version but always submitted one sample at a time. This time, I created a samplesheet and am running into an issue where hic doesn't think the file has a header. It does. The has_header() function of the cvs module used in check_samplesheet.py is overly stringent in how it defines headers and seems like it would fail for must samplesheets, as it does for mine.

The following sample sheets will fail and succeed, respectively:

sample,fastq_1,fastq_2
RH41_B6,1,2
SMS_A3,1,2
sample,fastq_1,fastq_2
RH41_B6,1,2
SMS_A3,p1,q2

Command used and terminal output

nextflow run /home/ass6094/bin/nextflow_modules/hic_v2.0.0/main.nf \
--digestion 'qiagen' \
--input /projects/b1103/HIC_Macquarrie/hic_round2/samplesheet.csv \
  --outdir $outdir \
  --fasta /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa  \
 --bwt2_index /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/   \
--split_fastq --fastq_chunks_size 10000000   --max_memory 64.GB   --bin_size \ 20000,40000,150000,500000,1000000  \ --bwt2_opts_end2end \
'--very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14'   --bwt2_opts_trimmed ' \
--very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14' \
-profile singularity,slurmshort   -with-report hic_report.html -with-trace \
-with-timeline hic_timeline.html   -with-dag hic_dag.png -bg   -w $scratch

Output:

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/hic v2.0.0
------------------------------------------------------
Core Nextflow options
  runName                      : magical_hamilton
  containerEngine              : singularity
  launchDir                    : /projects/b1103/HIC_Macquarrie/hic_round2
  workDir                      : /scratch/ass6094/hic/nextflow
  projectDir                   : /home/ass6094/bin/nextflow_modules/hic_v2.0.0
  userName                     : ass6094
  profile                      : singularity,slurmshort
  configFiles                  : /home/ass6094/bin/nextflow_modules/hic_v2.0.0/nextflow.config

Input/output options
  input                        : /projects/b1103/HIC_Macquarrie/hic_round2/samplesheet.csv
  outdir                       : /projects/b1103/HIC_Macquarrie/hic_round2/NextflowResults/

Reference genome options
  fasta                        : /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
  bwt2_index                   : /projects/genomicsshare/AWS_iGenomes/references/Homo_sapiens/NCBI/GRCh38/Sequence/Bowtie2Index/

Digestion Hi-C
  digestion                    : qiagen

DNAse Hi-C
  min_cis_dist                 : 0

Alignments
  split_fastq                  : true
  fastq_chunks_size            : 10000000
  bwt2_opts_end2end            : --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14
  bwt2_opts_trimmed            : --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder -p 14

Valid Pairs Detection
  max_insert_size              : 0
  min_insert_size              : 0
  max_restriction_fragment_size: 0
  min_restriction_fragment_size: 0

Contact maps
  bin_size                     : 20000,40000,150000,500000,1000000
  ice_filter_high_count_perc   : 0
  res_zoomify                  : null

Downstream Analysis
  res_dist_decay               : 250000
  tads_caller                  : insulation
  res_tads                     : 40000

Max job request options
  max_cpus                     : 14
  max_memory                   : 64.GB
  max_time                     : 10d

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/hic for your analysis please cite:

* The pipeline
  https://doi.org/10.5281/zenodo.2669513

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/hic/blob/master/CITATIONS.md
------------------------------------------------------
WARN: A process with name 'BOWTIE2_ALIGN_TRIMMED' is defined more than once in module script: /home/ass6094/bin/nextflow_modules/hic_v2.0.0/./workflows/../subworkflows/local/./hicpro_mapping.nf -- Make sure to not define the same function as process
[65/1f640c] Submitted process > NFCORE_HIC:HIC:PREPARE_GENOME:GET_RESTRICTION_FRAGMENTS (^GATC)
[11/c26455] Submitted process > NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)
[10/5e7883] Submitted process > NFCORE_HIC:HIC:PREPARE_GENOME:CUSTOM_GETCHROMSIZES (genome.fa)
Error executing process > 'NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)'

Caused by:
  Process `NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet.csv)` terminated with an error exit status (1)

Command executed:

  check_samplesheet.py \
      samplesheet.csv \
      samplesheet.valid.csv

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_HIC:HIC:INPUT_CHECK:SAMPLESHEET_CHECK":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:

Command error:
  WARNING: While bind mounting '/projects/b1103/HIC_Macquarrie/hic_round2:/projects/b1103/HIC_Macquarrie/hic_round2': destination is already in the mount point list
  WARNING: While bind mounting '/home/ass6094/bin/nextflow_modules/hic_v2.0.0/bin:/home/ass6094/bin/nextflow_modules/hic_v2.0.0/bin': destination is already in the mount point list
  WARNING: While bind mounting '/scratch/ass6094/hic/nextflow/11/c26455104fe4b102d7953120fa3a65:/scratch/ass6094/hic/nextflow/11/c26455104fe4b102d7953120fa3a65': destination is already in the mount point list
  WARNING: Skipping mount /hpc/software/singularity/3.8.1/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  [CRITICAL] The given sample sheet does not appear to contain a header.

Relevant files

No response

System information

nextflow version 22.10.5.5840 Hardware: Slurm HPC Executor: slurm Container engine:Singularity OS: Redhat Linux 7.9 Version of nf-core/hic 2.0.0

nservant commented 1 year ago

So for sure, this is linked to the csv.Sniffer.has_header function, which return false. No idea why.

nservant commented 1 year ago

I checked whether in both cases, the csv package is able to detect the delimiter, and yes. Both files report ',' as the delimiter ...

Line 60

d = sniffer.sniff(peek)
print(repr(d.delimiter))
nservant commented 1 year ago

So I think I have the solution for the provided exemple ! has_header return False because the two lines don't belong to the same type !

In

RH41_B6,1,2
SMS_A3,p1,q2

the 1 and 2 are seen as integer. While the p1 and p2 are seens as string.

https://github.com/python/cpython/issues/87791

nservant commented 1 year ago

To continue on that, and still based on the thread here https://github.com/python/cpython/issues/87791
It's seems that the has_header function automatically detects the type of a column based on its content (numbers/letters ?) When two rows have a different column typing pattern, the has_header return False

nservant commented 1 year ago
sample,fastq_1,fastq_2
101-male-brain,/data/file1_R1.fastq.gz,/data/file1_R2.fastq.gz
12-female-liver,/data/013649718184/file2_R1.fastq.gz,/data/013649718184/file2_R2.fastq.gz

is detected as having no header and crashed whereas

sample,fastq_1,fastq_2
101-male-brain,/data/file1_R1.fastq.gz,/data/file1_R2.fastq.gz
120-male-liver,/data/013649718184/file2_R1.fastq.gz,/data/013649718184/file2_R2.fastq.gz

works ! that's crasy :)

nservant commented 1 year ago

will be fixed in the next version

https://github.com/nf-core/tools/pull/2194

maxulysse commented 1 year ago

Just because this 12-female-liver -> 120-male-liver in the sample column?

nservant commented 1 year ago

yes. But this will be fixed in the next nf-core template