rki-mf1 / clean

A nextflow pipeline for decontamination of short reads, long reads and contigs
BSD 3-Clause "New" or "Revised" License
30 stars 3 forks source link

ONT clean minimap2/samtools error "query name too long" #35

Closed hoelzer closed 3 years ago

hoelzer commented 3 years ago

command:

 nextflow run clean/clean.nf --nano data/raw/fastq_pass/2021-05-11_histoplasma_ont.fastq.gz --host hsa --control dcs --cores 36 --max_cores 72 --output clean_gridion-basecalled_fastq -profile local,singularity --singularityCacheDir singularity_images -resume

error:


Error executing process > 'clean_nano:minimap2_nano (1)'

Caused by:
  Process `clean_nano:minimap2_nano (1)` terminated with an error exit status (1)

Command executed:

  PARAMS="-ax map-ont"
  if [[ false != 'false' ]]; then
    PARAMS="-ax splice -k14"
  fi

  minimap2 $PARAMS -N 5 --secondary=no -t 36 -o 2021-05-11_histoplasma_ont.sam db.fa.gz R.fastq

  samtools fastq -f 4 -0 R.clean.fastq 2021-05-11_histoplasma_ont.sam
  samtools fastq -F 4 -0 R.contamination.fastq 2021-05-11_histoplasma_ont.sam

  samtools view -b -F 2052 2021-05-11_histoplasma_ont.sam | samtools sort -o 2021-05-11_histoplasma_ont.contamination.sorted.bam --threads 36
  samtools index 2021-05-11_histoplasma_ont.contamination.sorted.bam
  samtools idxstats  2021-05-11_histoplasma_ont.contamination.sorted.bam > idxstats.tsv

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::82.008*2.02] collected minimizers
  [M::mm_idx_gen::93.121*2.84] sorted minimizers
  [M::main::93.121*2.84] loaded/built the index for 195 target sequence(s)
  [M::mm_mapopt_update::95.299*2.80] mid_occ = 705
  [M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 195
  [M::mm_idx_stat::96.508*2.78] distinct minimizers: 100159343 (38.79% are singletons); average occurrences: 5.540; average spacing: 5.586
  [M::worker_pipeline::107.705*4.83] mapped 236008 sequences
  [M::worker_pipeline::114.048*6.56] mapped 213499 sequences
  [M::worker_pipeline::119.300*7.87] mapped 212885 sequences
  [M::worker_pipeline::126.440*9.50] mapped 198236 sequences
  [M::worker_pipeline::132.862*10.43] mapped 208394 sequences
  [M::worker_pipeline::137.395*11.30] mapped 198667 sequences
  [M::worker_pipeline::142.667*12.23] mapped 209456 sequences
  [M::worker_pipeline::149.662*13.36] mapped 198352 sequences
  [M::worker_pipeline::153.842*13.82] mapped 208217 sequences
  [M::worker_pipeline::159.454*14.61] mapped 199491 sequences
  [M::worker_pipeline::164.832*15.32] mapped 208509 sequences
  [M::worker_pipeline::170.117*15.98] mapped 199400 sequences
  [M::worker_pipeline::175.694*16.62] mapped 208714 sequences
  [M::worker_pipeline::180.648*17.16] mapped 200316 sequences
  [M::worker_pipeline::186.048*17.72] mapped 211112 sequences
  [M::worker_pipeline::191.191*18.22] mapped 215964 sequences
  [M::worker_pipeline::199.764*18.88] mapped 227796 sequences
  [M::worker_pipeline::200.759*18.79] mapped 166159 sequences
  [M::main] Version: 2.17-r941
  [M::main] CMD: minimap2 -ax map-ont -N 5 --secondary=no -t 36 -o 2021-05-11_histoplasma_ont.sam db.fa.gz R.fastq
  [M::main] Real time: 200.978 sec; CPU: 3772.524 sec; Peak RSS: 12.177 GB
  [E::sam_parse1] query name too long
  [W::sam_read1] Parse error at line 197
  [bam2fq_mainloop] Failed to read bam record.

Work dir:
  /scratch/hoelzerm/projects/2021-05-10_BI116_Histoplasma_assembly/work/0a/fa9afda443b96162b47a043d51b748

I think the problem is that the read names are too long for samtools? But if so, why did we not discover that before?

head /scratch/hoelzerm/projects/*_assembly/work/51/d2f4a7b54b7c814bdf573db09af375/R.fastq
@11dbdf9c-e46f-4afd-b1b1-b7c3cdd2e938DECONTAMINATErunid=8842aa4e3c221ff265498dfdfdfe183253ab4499DECONTAMINATEread=9DECONTAMINATEch=247DECONTAMINATEstart_time=2021-04-28T12:20:06ZDECONTAMINATEflow_cell_id=FAP92766DECONTAMINATEprotocol_group_id=210428_GI5_Run21-086DECONTAMINATEsample_id=21-03679
GTTATGCCGTTCAGTTGCGTATTGCTAATAAGATTGGTAGAGAAAATGGAACATAAGCAACATTCTTCAAAATCACTTTGAACTGCTTGCCAACCTCATCAATTATATAGGTCATCACTGTTCCATATTCTTCAACCTAGTTGTTGCATCTTCAGTATACACTTGTTCTTCTGCAGATTTGAATAATAGAAATCTTGATCAATTATTGCATATATAGCAATACGTAAC
+
&$($&%$%-A6E);'+*&&%%$&-;3E=6:84114;2.'&*(+2B@11$0521/26*8A)$$%507<CFA@-)1))39<<:A<</..6>?=>4-*14:8B2515538486<8@4/6$"),%7$4,065:9>?D5.=?7)*.13=:,+8?FK6D.01)$'2757FBMEKRH9<;:==<@E/75412,))%).;=><<=;9@<45C)),96689%,**',8656-,-,%#
hoelzer commented 3 years ago

I updated minimap2 and samtools to newer versions, lets see if this helps... pushed the version updates to the open PR as well #34

MarieLataretu commented 3 years ago

I think I stumbled across this some time ago.

The thing is, that the length of read names in the SAM format specification is limited to 254. So samtools does basically the right thing, cause the format is incorrect.

If we want to avoid this we'd have to check every input read... Not sure if it's possible to cut the read names only if this error occures.

hoelzer commented 3 years ago

Do you actually remember why we do this renaming in the first place? :)

We insert DECONTAMINATE strings into the read ID to replace whitspaces I think. Likely, to restore the full read ID after mapping/read extraction?

see:

@11dbdf9c-e46f-4afd-b1b1-b7c3cdd2e938DECONTAMINATErunid=8842aa4e3c221ff265498dfdfdfe183253ab4499DECONTAMINATEread=9DECONTAMINATEch=247DECONTAMINATEstart_time=2021-04-28T12:20:06ZDECONTAMINATEflow_cell_id=FAP92766DECONTAMINATEprotocol_group_id=210428_GI5_Run21-086DECONTAMINATEsample_id=21-03679

Maybe, during the step where we anyway rename reads, we could check for the length and cut the string of at 254 characters?

Or we should really think if this renaming is necessary at all... I think the idea was to have the original (and complete) read IDs in the decontaminated FASTQ files.

MarieLataretu commented 3 years ago

Yeah, I also think it was because of the whitespaces and the mapper messing with the readnames!

Maybe, during the step where we anyway rename reads, we could check for the length and cut the string of at 254 characters?

and we could use something shorter than DECONTAMINATE ^^

hoelzer commented 3 years ago

and we could use something shorter than DECONTAMINATE ^^

haha, true ^^ we just need to be sure that it can be safely replaced... what we could also do: rename to something like

read1 read2 ... readXYZ

and save a map tsv between original name and re-name and finally use that to restore the original names.

But we should also not blow up this task...

here is a simple rename/restor py script for FASTAs:

https://github.com/EBI-Metagenomics/emg-viral-pipeline/blob/master/bin/rename_fasta.py

but maybe that's also easily doable w/ seqkit or so... https://bioinf.shenwei.me/seqkit/

MarieLataretu commented 3 years ago

So Basti says minimap does the right thing, because only the first part (til the first whitespace) is the actual read name, the rest is just additional information.

From this point of view, we could remove all the renaming :sweat_smile:

hoelzer commented 3 years ago

yeah... basically true. I think I wanted to keep all information bc/ e.g. I am not sure what happens if someone uses then the cleaned FASTQ w/ the missing description in e.g. a QC tool like NanoPlot etc...

But basically right... we can also skip the renaming and see if this causes trouble... this will also save time and disk usage. I'm fine w/ that