nf-core / fastquorum

Pipeline to produce consensus reads using unique molecular indexes/barcodes (UMIs)
https://nf-co.re/fastquorum
MIT License
20 stars 9 forks source link

Error at GroupReadsByUmi due to read order? #52

Open davidmasp opened 5 months ago

davidmasp commented 5 months ago

Description of the bug

Hi

I have hit an error trying to run one of the test datasets that merges multiple fastq files per sample (samplesheet.multi_lanes.csv).

I have to say first that when running with the testing dataset that includes only the chr17.fa the pipeline runs great. When i run it with my local version of the hg38 igenomes fasta however it hits an error at the GroupReadsByUmi step. The file should be accessible by downloading the hg38 version in this link - here.

I download the sample sheet from the test-dataset repo and I run the pipeline using:

nextflow run nf-core/fastquorum -r 1.0.0 -params-file params.yaml -w work.nobackup -resume

And the params.yaml file is:

input: ./samplesheet.multi_lanes.csv
outdir: ./results.nobackup
duplex_seq: yes
fasta: <<path to>>/iGenomes/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa
call_min_reads: 6 3
filter_min_reads: 3 3

The error is in the GroupReadsByUmi step, it claims that a given read is not sorted correctly (SRR6109255.382418).

Command error:
  INFO:    Environment variable SINGULARITYENV_TMP is set, but APPTAINERENV_TMP is preferred
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_TASK_WORKDIR is set, but APPTAINERENV_NXF_TASK_WORKDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
  [2024/05/27 03:00:20 | FgBioMain | Info] Executing GroupReadsByUmi from fgbio version 2.0.2 as <<USER>>@<<machine>> on JRE 11.0.9.1-internal+0-adhoc..src with snappy, IntelInflater, and IntelDeflater
  [2024/05/27 03:00:20 | GroupReadsByUmi | Info] Filtering the input.
  [2024/05/27 03:00:20 | GroupReadsByUmi | Info] Assigning reads to UMIs and outputting.
  [2024/05/27 03:00:20 | FgBioMain | Fatal] ########################################################################################
  [2024/05/27 03:00:20 | FgBioMain | Fatal] Execution failed!
  [2024/05/27 03:00:20 | FgBioMain | Fatal] Template SRR6109255.382418 has only one read, paired-reads required for paired strategy.
  [2024/05/27 03:00:20 | FgBioMain | Fatal] ########################################################################################
  [2024/05/27 03:00:20 | FgBioMain | Info] GroupReadsByUmi failed. Elapsed time: 0.04 minutes.

When going into the directory I can see that the problematic read (the first read in the file) is actually properly paired but and not in order?

> samtools view SRR6109255_three_lanes.bam | cut -f 1,2,3,4 | head
SRR6109255.382418       99      chr1    10466962
SRR6109255.143407       163     chr1    1913705
SRR6109255.143407       83      chr1    1913705
SRR6109255.441042       99      chr1    18777854
SRR6109255.382418       147     chr1    10467504
SRR6109255.441042       147     chr1    18777853
SRR6109255.58832        163     chr1    26281528
SRR6109255.58832        83      chr1    26281528

When I track down the merged bam one level more I reach the MERGE_BAM process. The individual bams seem to be all sorted correctly but the merged one (see above) may not be?

> samtools view 1/SRR6109255_three_lanes.mapped.bam |  cut -f 1,2,3,4 | head
SRR6109255.382418       99      chr1    10466962
SRR6109255.382418       147     chr1    10467504
SRR6109255.145917       99      chr1    28242786
SRR6109255.145917       147     chr1    28243430

> samtools view 2/SRR6109255_three_lanes.mapped.bam |  cut -f 1,2,3,4 | head
SRR6109255.143407       163     chr1    1913705
SRR6109255.143407       83      chr1    1913705
SRR6109255.764263       113     chr1    30571796
SRR6109255.764263       177     chr1    197441624

> samtools view 3/SRR6109255_three_lanes.mapped.bam |  cut -f 1,2,3,4 | head
SRR6109255.441042       99      chr1    18777854
SRR6109255.441042       147     chr1    18777853
SRR6109255.58832        163     chr1    26281528
SRR6109255.58832        83      chr1    26281528

I am not sure if I am doing smth wrong.

Command used and terminal output

nextflow run nf-core/fastquorum -r 1.0.0 -params-file params.yaml -w work.nobackup -resume

(information in the description)

Relevant files

No response

System information

HPC, running on SGE

Ubuntu Linux

N E X T F L O W version 24.04.1 build 5913 created 20-05-2024 09:48 UTC (02:48 PDT) cite doi:10.1038/nbt.3820 http://nextflow.io

nf-core/fastquorum = 1.0.0

nh13 commented 5 months ago

@davidmasp looks like a bug (that I introduced) in samtools: https://github.com/samtools/samtools/pull/2062.

nh13 commented 3 months ago

Fixed in https://github.com/samtools/samtools/pull/2062, but we'll need a samtools release before we can fix it here

SPPearce commented 3 months ago

Fixed in #68

SPPearce commented 3 months ago

But going to keep this one around until we have the samtools release and we can undo this fix.

SPPearce commented 1 month ago

Samtools v1.21 has now been released, so we can remove this fix