rust-bio / rust-bio-tools

A set of command line utilities based on Rust-Bio.
MIT License
182 stars 24 forks source link

bam-anonymize may require soft clipping for supplementary alignments #240

Closed jelber2 closed 2 years ago

jelber2 commented 2 years ago

Hi,

For the sequence in the link below aligned to the reference in the link below, bam-anonymize throws an error that I guess means there is no soft clipping for a supplementary alignment when using minimap2. My suggestion might be to produce an error message that might suggest turning on soft clipping for supplementary alignments in the aligner if the goal is to output all sequences from the BAM file later (i.e., I would not suggest changing bam-anonymize to ignore such a sequence).

get sequence and reference

wget https://www.dropbox.com/s/3r7eozi23dzrifk/test.fasta
wget https://www.dropbox.com/s/9vf8sa56aecgypk/reference.fasta.gz
gzip -d reference.fasta.gz
samtools faidx reference.fasta

map read to reference with minimap2 without the -Y option -Y use soft clipping for supplementary alignments

~/bin/minimap2-2.24_x64-linux/minimap2 -ax map-hifi reference.fasta test.fasta |samtools sort \
> 9.pbccs-6.3.0.fasta-to-reference.bam && samtools index 9.pbccs-6.3.0.fasta-to-reference.bam

[M::mm_idx_gen::0.165*1.01] collected minimizers
[M::mm_idx_gen::0.188*1.25] sorted minimizers
[M::main::0.188*1.25] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.203*1.24] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.210*1.23] distinct minimizers: 454545 (99.24% are singletons); average occurrences: 1.021; average spacing: 9.997; total length: 4641652
[M::worker_pipeline::0.223*1.21] mapped 1 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: /nfs/scistore16/itgrp/jelbers/bin/minimap2-2.24_x64-linux/minimap2 -ax map-hifi reference.fasta test.fasta
[M::main] Real time: 0.232 sec; CPU: 0.280 sec; Peak RSS: 0.042 GB

run bam-anonymize - returns error

/nfs/scistore16/itgrp/jelbers/.cargo/bin/rbt bam-anonymize 9.pbccs-6.3.0.fasta-to-reference.bam reference.fasta \
anonymous-reads.bam anonymous-reference.fasta NC_000913.3 1 4641652

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', /nfs/scistore16/itgrp/jelbers/.cargo/registry/src/github.com-1ecc6299db9ec823/rust-bio-tools-0.39.0/src/bam/anonymize_reads.rs:139:60
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

map read to reference with minimap2 with the -Y option

~/bin/minimap2-2.24_x64-linux/minimap2 -ax map-hifi -t 34 -Y reference.fasta test.fasta |samtools sort \
> 9.pbccs-6.3.0.fasta-to-reference.bam && samtools index 9.pbccs-6.3.0.fasta-to-reference.bam

[M::mm_idx_gen::0.166*1.02] collected minimizers
[M::mm_idx_gen::0.182*2.89] sorted minimizers
[M::main::0.182*2.89] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.191*2.80] mid_occ = 50
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.197*2.74] distinct minimizers: 454545 (99.24% are singletons); average occurrences: 1.021; average spacing: 9.997; total length: 4641652
[M::worker_pipeline::0.212*2.63] mapped 1 sequences
[M::main] Version: 2.24-r1122
[M::main] CMD: /nfs/scistore16/itgrp/jelbers/bin/minimap2-2.24_x64-linux/minimap2 -ax map-hifi -t 34 -Y reference.fasta test.fasta
[M::main] Real time: 0.220 sec; CPU: 0.566 sec; Peak RSS: 0.045 GB

run bam-anonymize - runs without error using -Y with minimap2

/nfs/scistore16/itgrp/jelbers/.cargo/bin/rbt bam-anonymize 9.pbccs-6.3.0.fasta-to-reference.bam reference.fasta \
anonymous-reads.bam anonymous-reference.fasta NC_000913.3 1 4641652

Rust-Bio-Tools version

/nfs/scistore16/itgrp/jelbers/.cargo/bin/rbt --version

Rust-Bio-Tools 0.39.0
FelixMoelder commented 2 years ago

Hi @jelber2,

thank's for notifying about this. I was able to reproduce this behavior and will have at look at it.

FelixMoelder commented 2 years ago

I found the bug which should be fixed in PR https://github.com/rust-bio/rust-bio-tools/pull/241. The error did not occur due to soft clips but due to empty sequences set by minimap2. Applying the -Y causes minimap2 to annotate the full sequence to each record which is why the subcommand worked fine. Nevertheless, this should be fixed in the next release of rust-bio-tools.

jelber2 commented 2 years ago

thanks @FelixMoelder!