Can't align PacBio/ONT reads on genomes > 4Gbp

muffato commented 10 months ago

Description of the bug

By default, minimap2 can only deal with genomes up to 4 Gbp. For larger genomes, it needs either the --split-prefix option or the -I option. See more information at https://github.com/lh3/minimap2/blob/master/FAQ.md#3-the-output-sam-doesnt-have-a-header

Without that, samtools complain and the process crashes.

Command used and terminal output

Error executing process > 'SANGERTOL_READMAPPING:READMAPPING:ALIGN_HIFI:MINIMAP2_ALIGN (iqMecThal1_T2)'

Caused by:
  Process `SANGERTOL_READMAPPING:READMAPPING:ALIGN_HIFI:MINIMAP2_ALIGN (iqMecThal1_T2)` terminated with an error exit status (1)

Command executed:

  minimap2 \
      -ax map-hifi --cs=short \
      -t 6 \
      "GCA_946902985.2.unmasked.fasta" \
      "iqMecThal1_T2_other.fastq.gz" \
       \
       \
      -a | samtools sort | samtools view -@ 6 -b -h -o iqMecThal1_T2.bam

  cat <<-END_VERSIONS > versions.yml
  "SANGERTOL_READMAPPING:READMAPPING:ALIGN_HIFI:MINIMAP2_ALIGN":
      minimap2: $(minimap2 --version 2>&1)
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::73.626*1.36] collected minimizers
  [M::mm_idx_gen::82.689*1.85] sorted minimizers
  [WARNING][1;31m For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.[0m
  [M::main::82.689*1.85] loaded/built the index for 28 target sequence(s)
  [M::mm_mapopt_update::85.125*1.83] mid_occ = 500
  [M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 28
  [M::mm_idx_stat::86.619*1.82] distinct minimizers: 130653147 (82.35% are singletons); average occurrences: 3.431; average spacing: 9.970; total length: 4469457077
  [E::sam_parse1] no SQ lines present in the header
  [W::sam_read1_sam] Parse error at line 2
  samtools sort: truncated file. Aborting
  [main_samview] fail to read the header from "-".

Relevant files

No response

System information

No response

muffato commented 10 months ago

I've started both commands. Will report on the runtime and memory usage when they complete.

First snag: [ERROR] --cs or --MD doesn't work with --split-prefix. --cs=short was required by HiMut for somatic variant calling.

muffato commented 10 months ago

-I mode

5h 50 min runtime, 45 GB RAM

--split-prefix mode

19h 30 min runtime, 29 GB RAM. I can see in the log file how the target genome was virtually split into three chunks, each under 4 Gbp, and reads were aligned to each chunk. That's why it was about 3 times as slow as the -I mode.

For this pipeline, I would go with the -I option as i) the extra memory requirement isn't hard to fulfil and is worth the speed increase, ii) --split-prefix clashes with -cs

sanger-tol / readmapping