sstadick / perbase

Per-base per-nucleotide depth analysis
MIT License
115 stars 13 forks source link

weird encoding for reference name #66

Open twaddlac opened 1 year ago

twaddlac commented 1 year ago

I'm trying to run perbase depth-only in.bam and I am getting the following output:

image

The bam file isn't corrupt (according to samtools) and works with other parts of my pipeline.

This reproduced when running base-depth as well, and replicated by other users.

Running perbase v0.8.5

Please let me know if I can provide any more information!

sstadick commented 11 months ago

Hi! could you share the bamfile / part of the bamfile that leads to this? Or paste in the bam header output from samtools when used on that bamfile as well.

davidecarlson commented 9 months ago

I'm experiencing the same issue with a few different references I'm trying to get perbase to work with:

REF     POS     DEPTH   A       C       G       T       N       INS     DEL     REF_SKIP        FAIL    NEAR_MAX_DEPTH
^@^@^@^@^@^@^@^P<9B>^C<F0><FF>^?^@^@!   9       26      0       0       22      0       4       0       0       0       0       false
^@^@^@^@^@^@^@<E0>}^A<F0><FF>^?^@^@!    10      98      0       0       0       83      15      0       0       0       0       false
^@^@^@^@^@^@^@^P~^A<F0><FF>^?^@^@!      11      154     0       123     0       0       31      0       0       0       0       false
^@^@^@^@^@^@^@p<E1>^@<F0><FF>^?^@^@!    12      161     0       0       0       123     38      0       0       0       0       false
^@^@^@^@^@^@^@<80><CC>^@<F0><FF>^?^@^@! 13      167     134     0       0       0       33      0       0       0       0       false
^@^@^@^@^@^@^@`%^E<F0><FF>^?^@^@!       14      675     0       546     0       1       128     0       0       0       0       false
^@^@^@^@^@^@^@^@<B9>^C<F0><FF>^?^@^@!   15      716     0       0       581     0       135     0       0       0       0       false
^@^@^@^@^@^@^@<E0>:^E<F0><FF>^?^@^@!    16      782     0       0       1       588     193     0       0       0       0       false
^@^@^@^@^@^@^@^P;^E<F0><FF>^?^@^@!      17      870     0       1       556     0       313     0       0       0       0       false
^@^@^@^@^@^@^@^P^U^L<F0><FF>^?^@^@!     18      940     0       0       710     0       230     0       0       0       0       false
^@^@^@^@^@^@^@^P^F^L<F0><FF>^?^@^@!     19      973     832     0       0       0       141     0       0       0       0       false

Here is my bam file header:

@HD VN:1.6  SO:coordinate
@SQ SN:DENV3-E-W_Min_preMVS_P3_CDX430_v2    LN:10708
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -t 96 /references/E3093-3_S3_ref.fasta /bbmerge_out/E3093-3_S3_merged_perfect.fastq
@PG ID:samtools PN:samtools PP:bwa  VN:1.18 CL:samtools view -bS -
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.18 CL:samtools sort -@ 96 --output-fmt BAM -o ../bwa-mem_out/bbmerged/E3093-3_S3.sorted.bam ../bwa-mem_out/bbmerged/E3093-3_S3.bam
@PG ID:MarkDuplicates   VN:3.1.0    CL:MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=10000 INPUT=[../bwa-mem_out/bbmerged/E3093-3_S3.sorted.bam] OUTPUT=../bwa-mem_out/bbmerged/E3093-3_S3.sorted.markdup.bam METRICS_FILE=../bwa-mem_out/bbmerged/E3093-3_S3.sorted.markdup.metrics REMOVE_DUPLICATES=false ASSUME_SORTED=true TMP_DIR=[/gpfs/projects/GenomicsCore/codagenix/scripts/tmp] VALIDATION_STRINGENCY=LENIENT    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_QUALITY_SUM_STRATEGY=false USE_END_IN_UNPAIRED_READS=false USE_UNPAIRED_CLIPPED_END=false UNPAIRED_END_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false  PN:MarkDuplicates
@PG ID:samtools.2   PN:samtools PP:samtools.1   VN:1.18 CL:samtools view -H E3093-3_S3.sorted.markdup.bam

I've also noticed that if I supply the reference fasta file in my perbase command, I get the following error:

[2023-11-13T19:38:56Z INFO  perbase_lib::par_granges] Using 94 worker threads.
[2023-11-13T19:38:56Z INFO  perbase_lib::par_granges] Creating channel of length 189246910 (* 120 bytes to get mem)
[2023-11-13T19:39:04Z INFO  perbase_lib::par_granges] Reading from "bwa-mem_out/bbmerged/E3093-3_S3.sorted.markdup.bam"
[2023-11-13T19:39:04Z INFO  perbase_lib::par_granges] Processing TID 0:0-10708
thread '<unnamed>' panicked at 'Fetched reference sequence: Unknown sequence name: ���!.', src/commands/base_depth.rs:275:30
stack backtrace:
   0:     0x55555575ea10 - std::backtrace_rs::backtrace::libunwind::trace::ha9053a9a07ca49cb
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55555575ea10 - std::backtrace_rs::backtrace::trace_unsynchronized::h9c2852a457ad564e
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55555575ea10 - std::sys_common::backtrace::_print_fmt::h457936fbfaa0070f
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55555575ea10 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5779d7bf7f70cb0c
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x5555556b9bfe - core::fmt::write::h5a4baaff1bcd3eb5
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/fmt/mod.rs:1232:17
   5:     0x555555738ea4 - std::io::Write::write_fmt::h4bc1f301cb9e9cce
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/io/mod.rs:1684:15
   6:     0x55555576023f - std::sys_common::backtrace::_print::h5fcdc36060f177e8
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x55555576023f - std::sys_common::backtrace::print::h54ca9458b876c8bf
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x55555575fe3f - std::panicking::default_hook::{{closure}}::hbe471161c7664ed6
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:271:22
   9:     0x555555760e58 - std::panicking::default_hook::ha3500da57aa4ac4f
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:290:9
  10:     0x555555760e58 - std::panicking::rust_panic_with_hook::h50c09d000dc561d2
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:692:13
  11:     0x555555760914 - std::panicking::begin_panic_handler::{{closure}}::h9e2b2176e00e0d9c
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:583:13
  12:     0x555555760876 - std::sys_common::backtrace::__rust_end_short_backtrace::h5739b8e512c09d02
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys_common/backtrace.rs:150:18
  13:     0x555555760861 - rust_begin_unwind
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
  14:     0x5555555d60b2 - core::panicking::panic_fmt::hf33a1475b4dc5c3e
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
  15:     0x5555555d63e2 - core::result::unwrap_failed::hdff5465d74574b44
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/result.rs:1750:5
  16:     0x5555555f669a - <core::iter::adapters::flatten::FlatMap<I,U,F> as core::iter::traits::iterator::Iterator>::next::h4ec7ffc6086dfa1a
  17:     0x555555624d34 - rayon::iter::plumbing::bridge_producer_consumer::helper::h40d19706cb30b716
  18:     0x5555555ef2ca - rayon_core::join::join_context::{{closure}}::he92c561eced12757
  19:     0x5555555e6b56 - rayon_core::thread_pool::ThreadPool::install::{{closure}}::h185c0d76b416e80d
  20:     0x555555632daf - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb3e400f8a6fd84fd
  21:     0x5555555da5ed - rayon_core::registry::WorkerThread::wait_until_cold::h441bdd8e1d8196e3
  22:     0x5555556dc9ba - std::sys_common::backtrace::__rust_begin_short_backtrace::h933f37f9b851e73f
  23:     0x5555556dc74d - core::ops::function::FnOnce::call_once{{vtable.shim}}::h1b5aaf29acd6a35f
  24:     0x555555761b45 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h39990b24eedef2ab
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:1987:9
  25:     0x555555761b45 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h01a027258444143b
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:1987:9
  26:     0x555555761b45 - std::sys::unix::thread::Thread::new::thread_start::ha4f1cdd9c25884ba
                               at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys/unix/thread.rs:108:17
  27:     0x7ffff79af1cf - start_thread
  28:     0x7ffff7094e73 - clone
  29:                0x0 - <unknown>

As far as I can tell, there is nothing wrong with either my reference fasta file or the input bam file. Other programs (samtools, bwa, bbmap, etc.) don't report any issues with them.

Any idea what the problem could be? I'm happy to share the full fasta and/or bam files if it would be useful.

Best, Dave

sstadick commented 9 months ago

If you had a minimal example with a bamfile and reference that would be awesome!

I'm really not sure what the issue is. This function is what would generate the ref_seq name, which in these outputs looks very mangled (breadcrumb for myself looking at this again later)

davidecarlson commented 9 months ago

Sure thing. The reference is very small (~10 Kb) and I've extracted a subset of 10k BAM entries that produce the weird results.

Thanks for taking a look!

Best, Dave example.zip

davidecarlson commented 9 months ago

Quick update - I reinstalled the latest version of perbase, and I'm no longer seeing the same errors on the exact same reference fastas and bam files.

I don't know what caused the issue in the first place, but it's working now, so I'm happy! :)

Thanks! Dave

twaddlac commented 9 months ago

@sstadick Sorry for the delay here! @davidecarlson thanks for chiming in as well. I installed 0.9.0 with no avail. I'll share the data here once I have clearance to do so (hopefully I will soon).

sstadick commented 9 months ago

@davidecarlson I'm glad that worked! @twaddlac 👍