Closed skanwal closed 1 year ago
Hi Sehrish,
Thanks for reporting this bug. We can solve this in three ways (in decreasing order of preference):
samtools view PRJ222361.bam | grep NAME_OF_READ_PRINTED_BY_MODIFIED_ARRIBA
. Hopefully, this single read will be sufficient to reproduce the error on my end so that I can find a fix.Regards, Sebastian
Hi Sebastian,
Thanks for the response. Point 1 is indeed not possible as this is patients data. Lets first try number 2 and see if this helps you narrow down the issue. We can then decide if we need to look into option 3.
How would you share the modified version of Arriba?
You can download a debug version here: arriba_debug_substr.zip
There is a precompiled binary in the zip package as well as the source code in case you need to recompile, because the precompiled version does not run on your system.
This debug version should print the name of the offending read as the last line. You can then extract this read by name from the BAM and send it to me.
Hi Sebastian and happy new year,
We are running our analysis on a cloud environment using CWL that uses docker containers. For arriba, we are currently using uhrigs/arriba:2.3.0
.
Is it possible for you to push the dev binary to docker hub? This will simplify the testing process for me as our primary data (BAM) and reference data for arriba all sit it the same cloud environment. I suspect running into memory and resource issues with trying to run/debug this locally.
Happy new year to you, too.
A docker image with the debug version is now available under the tag: uhrigs/arriba-debug:2.3.0
Thank you. I have used the new tag uhrigs/arriba-debug:2.3.0
on PRJ222361.bam
.
The std output says:
[2023-01-03T23:13:36] Launching Arriba 2.3.0
[2023-01-03T23:13:36] Loading assembly from '/data/cwl-stage-dir/stg65e81d99-0d88-438a-99b7-1f868f405794/hg38.fa'
[2023-01-03T23:13:50] Loading annotation from '/data/cwl-stage-dir/stgaaddd15b-d4cd-486c-b54c-74362b8eed29/ref-transcripts.non-zero-length.gtf'
WARNING: 9874243 SAM records were malformed and ignored
[2023-01-03T23:13:56] Reading chimeric alignments from '/data/cwl-stage-dir/stgcb6098ea-0d1c-4bab-92ec-37ed1a76928d/PRJ222361.bam' (total=1658357)
[2023-01-03T23:19:08] Marking multi-mapping alignments (marked=352821)
[2023-01-03T23:19:08] Detecting strandedness (reverse)
[2023-01-03T23:19:08] Assigning strands to alignments
[2023-01-03T23:19:09] Annotating alignments
[2023-01-03T23:19:15] Filtering duplicates (remaining=603378)
[2023-01-03T23:19:17] Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_*) (remaining=579984)
[2023-01-03T23:19:17] Filtering mates which only map to viral contigs (AC_* NC_*) (remaining=579984)
[2023-01-03T23:19:17] Filtering viral contigs with expression lower than the top 5 (remaining=579984)
[2023-01-03T23:19:18] Filtering viral contigs with less than 5% coverage (remaining=579984)
[2023-01-03T23:19:19] Estimating fragment length (mate gap mean=-70.4499, mate gap stddev=39.5602, read length mean=141.7)
[2023-01-03T23:19:19] Filtering read-through fragments with a distance <=10000bp (remaining=558267)
[2023-01-03T23:19:20] Filtering inconsistently clipped mates (remaining=483835)
[2023-01-03T23:19:20] Filtering breakpoints adjacent to homopolymers >=6nt (remaining=430906)
[2023-01-03T23:19:20] Filtering fragments with small insert size (remaining=396970)
[2023-01-03T23:19:21] Filtering alignments with long gaps (remaining=396970)
[2023-01-03T23:19:21] Filtering fragments with both mates in the same gene (remaining=377177)
[2023-01-03T23:19:21] Filtering fusions arising from hairpin structures (remaining=312550)
[2023-01-03T23:19:21] Filtering reads with a mismatch p-value <=0.01 (remaining=184339)
[2023-01-03T23:19:23] Filtering reads with low entropy (k-mer content >=60%) (remaining=150112)
WARNING: some fusions were subsampled, because they have more than 300 supporting reads
[2023-01-03T23:19:24] Finding fusions and counting supporting reads (total=178362)
[2023-01-03T23:19:29] Merging adjacent fusion breakpoints (remaining=178319)
[2023-01-03T23:19:30] Filtering multi-mapping fusions by alignment score and read support (remaining=132433)
[2023-01-03T23:19:34] Estimating expected number of fusions by random chance (e-value)
[2023-01-03T23:19:35] Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=127706)
[2023-01-03T23:19:35] Filtering intragenic fusions with both breakpoints in exonic regions (remaining=126006)
[2023-01-03T23:19:36] Filtering fusions with <2 supporting reads (remaining=66249)
[2023-01-03T23:19:36] Filtering fusions with an e-value >=0.3 (remaining=51145)
[2023-01-03T23:19:36] Searching for internal tandem duplications <=100bp with >=10 supporting reads and >=7% allele fraction (remaining=51151)
[2023-01-03T23:19:36] Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=33573)
[2023-01-03T23:19:36] Filtering in vitro-generated fusions between genes with an expression above the 99.8% quantile (remaining=767)
[2023-01-03T23:19:38] Searching for fusions with spliced split reads (remaining=770)
[2023-01-03T23:19:39] Selecting best breakpoints from genes with multiple breakpoints (remaining=180)
[2023-01-03T23:19:40] Filtering read-through fusions with breakpoints near the gene boundary (remaining=179)
[2023-01-03T23:19:40] Searching for fusions with >=4 spliced events (remaining=183)
[2023-01-03T23:19:40] Filtering blacklisted fusions in '/data/cwl-stage-dir/stg23f32962-e077-423a-a47e-2855121dcdef/blacklist_hg38_GRCh38_v2.3.0.tsv.gz' (remaining=136)
[2023-01-03T23:20:01] Filtering fusions with anchors <=23nt (remaining=133)
[2023-01-03T23:20:01] Filtering end-to-end fusions with low support (remaining=116)
[2023-01-03T23:20:01] Filtering fusions with no coverage around the breakpoints (remaining=114)
[2023-01-03T23:20:01] Indexing gene sequences
[2023-01-03T23:20:04] Filtering genes with >=30% identity (remaining=29)
[2023-01-03T23:20:04] Re-aligning chimeric reads to filter fusions with >=80% mis-mappers
A01052:137:H7Y5NDSX5:1:1332:18005:3740
Grepping the offending alignment from bam returns following results:
$ samtools view PRJ222361.bam | grep "A01052:137:H7Y5NDSX5:1:1332:18005:3740"
A01052:137:H7Y5NDSX5:1:1332:18005:3740 83 chr10 28590725 0 61S25M65S = 28590725 -25 GCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAG ,FFFFFFFF:FFFF:FFFFF,F,FFFFF:FFFFFFFF:F,F,FFFFFFFFFFF:F:FFFF::,FFF:FFFFFFF,FFFFFFFFF:FFF::FFF:FFFFFFFFFF,FFFFFFF:FF:FFFF:FFF::FFFF:FFFFF:FFFFFFFFFFFFFF RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ZS:Z:R ps:i:50 HI:i:2 NH:i:2 NM:i:0 SD:f:0 SA:Z:chr10,28590725,-,36H25M90H,8,0;chr10,28590725,-,11H25M115H,8,0;chr10,28590725,-,86H25M40H,8,0;
A01052:137:H7Y5NDSX5:1:1332:18005:3740 163 chr10 28590725 0 56S25M70S = 28590725 25 CAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAACAAGATAGAGACAAAAAGAAGCAAAC FFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF,FFFFFFF,F:FFF,FFF:FFF:FF::F:FFF:F::FF:FFFFFFFFFFFF,FF::FFFF,:FFFF:,FFFF,F:F,FFFFFFF:FFFF:FF:::F RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ps:i:50 HI:i:1 NH:i:1 NM:i:0 SD:f:0 SA:Z:chr10,28590725,+,106H25M20H,0,0;chr10,28590725,+,81H25M45H,8,0;chr10,28590725,+,31H25M95H,0,0;
A01052:137:H7Y5NDSX5:1:1332:18005:3740 339 chr10 28590725 0 111H25M15H = 28590725 -25 AGAGACAAAAAGAAGCAAACAAGAT F:FF:FFFF:FFF::FFFF:FFFFF RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ps:i:50 HI:i:1 NH:i:2 NM:i:0 SD:f:0
A01052:137:H7Y5NDSX5:1:1332:18005:3740 2129 chr10 28590725 8 86H25M40H = 28590725 -25 AGAGACAAAAAGAAGCAAACAAGAT FF::FFF:FFFFFFFFFF,FFFFFF RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ps:i:-8388608 NM:i:0 SD:f:0 SA:Z:chr10,28590725,-,61S25M65S,0,0;chr10,28590725,-,36H25M90H,8,0;chr10,28590725,-,11H25M115H,8,0;
A01052:137:H7Y5NDSX5:1:1332:18005:3740 2129 chr10 28590725 8 36H25M90H = 28590725 -25 AGAGACAAAAAGAAGCAAACAAGAT F:F,F,FFFFFFFFFFF:F:FFFF: RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ps:i:-8388608 NM:i:0 SD:f:0 SA:Z:chr10,28590725,-,61S25M65S,0,0;chr10,28590725,-,11H25M115H,8,0;chr10,28590725,-,86H25M40H,8,0;
A01052:137:H7Y5NDSX5:1:1332:18005:3740 2129 chr10 28590725 8 11H25M115H = 28590725 -25 AGAGACAAAAAGAAGCAAACAAGAT FFF:FFFFF,F,FFFFF:FFFFFFF RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ps:i:-8388608 NM:i:0 SD:f:0 SA:Z:chr10,28590725,-,61S25M65S,0,0;chr10,28590725,-,36H25M90H,8,0;chr10,28590725,-,86H25M40H,8,0;
A01052:137:H7Y5NDSX5:1:1332:18005:3740 2209 chr10 28590725 0 106H25M20H = 28590725 25 AGAGACAAAAAGAAGCAAACAAGAT ,FF::FFFF,:FFFF:,FFFF,F:F RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ps:i:-8388608 NM:i:0 SD:f:0 SA:Z:chr10,28590725,+,56S25M70S,0,0;chr10,28590725,+,81H25M45H,8,0;chr10,28590725,+,31H25M95H,0,0;
A01052:137:H7Y5NDSX5:1:1332:18005:3740 2209 chr10 28590725 8 81H25M45H = 28590725 25 AGAGACAAAAAGAAGCAAACAAGAT :F:FFF:F::FF:FFFFFFFFFFFF RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ps:i:-8388608 NM:i:0 SD:f:0 SA:Z:chr10,28590725,+,56S25M70S,0,0;chr10,28590725,+,106H25M20H,0,0;chr10,28590725,+,31H25M95H,0,0;
A01052:137:H7Y5NDSX5:1:1332:18005:3740 2209 chr10 28590725 0 31H25M95H = 28590725 25 AGAGACAAAAAGAAGCAAACAAGAT FFFFFF:FFFFFFFFFFFFFFFFF, RG:Z:TCCTGGAC.CTGCTTAA.1.221215_A01052_0137_BH7Y5NDSX5.PRJ222361_L2201824 AS:i:25 ps:i:-8388608 NM:i:0 SD:f:0 SA:Z:chr10,28590725,+,56S25M70S,0,0;chr10,28590725,+,106H25M20H,0,0;chr10,28590725,+,81H25M45H,8,0;
I'll be interested to know if this helps you narrow down the issue. Please let me know if further information is needed.
Thank you for providing the details. Was this alignment generated using STAR? If so, what were the alignment parameters? You can get this info from the BAM header using samtools view -H PRJ222361.bam | grep STAR
.
We use Dragen pipeline for processing of WTS data and the alignment was produced using Dragen's aligner, which we believe (based on our assessment) is STAR-like.
In case of interest, the internal Dragen parameters used to generate this bam were (copied from the header):
@PG ID: DRAGEN SW build VN: 05.021.609.3.9.3 CL: /opt/edico/bin/dragen --fastq-list fastq_list.csv --intermediate-results-dir /ephemeral/intermediate-results/ --enable-rna true --annotation-file /data/cwl-stage-dir/stg087f7dcd-7b91-4706-9e51-8b1e67d19284/ref-transcripts.non-zero-length.gtf --enable-duplicate-marking false --enable-map-align-output true --enable-rna-gene-fusion true --enable-rna-quantification true --rrna-filter-enable true --lic-instance-id-location /opt/instance-identity --output-directory L2201824_dragen --output-file-prefix PRJ222361 --ref-dir /ephemeral/ref/hg38-v8-altaware-cnv-anchored/ DS: Aligner.align-direction: 4 Aligner.aln-en-mask: 15 Aligner.aln-enable: 1 Aligner.backtrace-delay: 8 Aligner.bed-score-bonus: 11 Aligner.chicken: 0 Aligner.dedup-min-qual: 15 Aligner.disable-lfsr: 0 Aligner.en-alt-hap-aln: 1 Aligner.en-chimeric-aln: 1 Aligner.exon-jump-en: 1 Aligner.filt-clip1-pair: 10 Aligner.filt-clip1-unpr: -5 Aligner.filt-clip2-pair: 5 Aligner.filt-clip2-unpr: -10 Aligner.filt-min-qual: 0 Aligner.fix-overlap-mapq: 0 Aligner.gap-ext-pen: 1 Aligner.gap-open-pen: 6 Aligner.global: 0 Aligner.hard-clips: 6 Aligner.hw-mapq-max: 250 Aligner.leftalign-mode: 2 Aligner.mapq-coeff: 152 Aligner.mapq-floor-1snp: 0 Aligner.mapq-max: 60 Aligner.mapq-min-len: 50 Aligner.mapq-strict-sjs: 0 Aligner.match-score: 1 Aligner.max-rescues: 1023 Aligner.max-stitch-olap: 48 Aligner.min-overhang: 6 Aligner.min-overhang-ann: 4 Aligner.min-score-coeff: 0.0 Aligner.mismatch-pen: 4 Aligner.no-align-score: -8388608 Aligner.no-ambig-strand: 1 Aligner.no-noncan-motifs: 1 Aligner.no-unclip-score: 1 Aligner.no-unpaired: 0 Aligner.paired-mate-info: 1 Aligner.pe-max-penalty: 255 Aligner.pe-orientation: 0 Aligner.pe-sample-max-insert: 65535 Aligner.resc-ifpair-len: 48 Aligner.resc-nopair-len: 0 Aligner.rescue-ceil-factor: 3 Aligner.rescue-hifreq: 0 Aligner.rescue-kmer-len: 32 Aligner.rescue-max-snps: 7 Aligner.rna-max-insert: 4000000 Aligner.rna-pair-pen-rat: 5.71875 Aligner.sample-mapq0: 1 Aligner.sec-aligns: 9 Aligner.sec-aligns-hard: 1 Aligner.sec-phred-delta: 0 Aligner.sec-score-delta: 0 Aligner.supp-aligns: 3 Aligner.supp-as-sec: 0 Aligner.sw-all: 0 Aligner.sw-burst-diffs: 1 Aligner.sw-early-max: 256 Aligner.sw-extra-intvl: 1 Aligner.unclip-score: 5 Aligner.unpaired-pen: 80 Aligner.xs-pair-penalty: 25 Mapper.ann-sj-max-indel: 10 Mapper.chain-diam-lim: 8 Mapper.chain-rad-lim: 5 Mapper.dark-base: 2 Mapper.edit-chain-limit: 29 Mapper.edit-mode: 0 Mapper.edit-read-len: 100 Mapper.edit-seed-num: 6 Mapper.filt-good-qual: 0 Mapper.filter-len-ratio: 4 Mapper.intvl-max-hits: 16 Mapper.intvl-min-chains: 8 Mapper.intvl-sample-hits: 16 Mapper.intvl-seed-length: 60 Mapper.intvl-seed-longer: 8 Mapper.intvl-target-hits: 32 Mapper.map-orientations: 0 Mapper.max-dram-reqs: 0 Mapper.max-hifreq-hits: 16 Mapper.max-intron-bases: 200000 Mapper.max-lowq-bases: 4294967295 Mapper.max-lowq-ratio: 4294967295 Mapper.max-seed-chains: 511 Mapper.max-seed-freq: 32 Mapper.max-splice-gap: 150 Mapper.max-splice-olap: 16 Mapper.min-intron-bases: 20 Mapper.n-base-qual: 2 Mapper.read-edge-seeds: 0 Mapper.reduce-seed-ext: 1 Mapper.rna-filt-ratio: 25 Mapper.rna-max-covg-gap: 150 Mapper.rna-max-recurs: 65536 Mapper.rna-span-log-min: 13 Mapper.seed-density: 0.5 Mapper.seed-max-age: 31 Mapper.seed-old-age: 9 Mapper.splice-olap-adj: 4 Mapper.stats-source-sel: 0 Mapper.trace-mode: 0 Mapper.trace-offset: 0 Mapper.trace-read-id: 0 annotation-file: /data/cwl-stage-dir/stg087f7dcd-7b91-4706-9e51-8b1e67d19284/ref-transcripts.non-zero-length.gtf append-read-index-to-name: false assert-valid-cigar: false bam2dbam-threads: 12 bin-split-target-size: 104857600 bin-split-threshold: 7158278826 binner-use-odirect: true bqsr-context-low-quality-tail: 2 bqsr-cycle-indel-context: 3 bqsr-cycle-mismatch-context: 2 bqsr-emit-indel-tags: true bqsr-enable-recal-indels: true bqsr-match-gatk: true bqsr-max-cycle-value: 500 c2s_aligner_packet_size: 524288 c2s_aligner_pool_size: 64 c2s_decomp_packet_size: 1048576 c2s_decomp_pool_size: 64 c2s_graph_packet_size: 65536 c2s_graph_pool_size: 256 c2s_hmm_packet_size: 4096 c2s_hmm_pool_size: 128 c2s_smw_packet_size: 4096 c2s_smw_pool_size: 128 cgvcf-num-file-scan-threads: 8 cgvcf-save-tmp-files: false cgvcf-split-chromosomes: false combine-samples-by-name: false credentials-1: 6HH1bOskRJM= credentials-2: 63gs0Eg5se3p58IwLJlJhSlhzFmWTl2Q credentials-3: dbam2bam_threads: 32 debug: 0 disable_reg_validation: 0 distinct-dbam-input-format: true dump-hang-diag-first: true dump-map-align-registers: 0 dump_config: 0 dump_registers: 0 dupmark-version: hash echo_aligner_log: 0 echo_general_log: 0 echo_mapper_log: 0 enable-auto-multifile: true enable-bqsr: false enable-deterministic-sort: true enable-duplicate-marking: false enable-hang-diag: true enable-hla: false enable-http-server: false enable-map-align: true enable-map-align-output: true enable-methylation-calling: true enable-pstack: true enable-public-bitstream: false enable-rna: true enable-rna-gene-fusion: true enable-rna-quantification: true enable-sampling: true enable-single-cell-rna: false enable-sort: true enable-spin: false enable-umi-stat-estimator: false enable-variant-caller: false enable-variant-deduplication: false enable-vcf-indexing: true enable-watchdog: true enable-write-input-dbam: false evict_all_intermediate_results: false fastq-list: fastq_list.csv fastq-n-quality: 2 fastq-offset: 33 fastq2dbam_ratio: 2 fastq_block_size: 1048576 fastq_pool_size: 4 fastqc-adapter-file: adapter_sequences.fasta fastqc-granularity: 7 fastqc-only: false filter-flags-from-output: 0 gc-metrics-cover-percent: 75 gc-metrics-enable: false gc-metrics-num-bins: 5 gc-metrics-only-covered: false gc-metrics-window-size: 100 generate-en-tags: false generate-md-tags: false generate-xq-tags: true generate-zs-tags: false hla-allele-frequency-file: hla_classI_allele_frequency.csv hla-reference-file: hla_classI_ref_freq.fasta hla-tiebreaker-threshold: 0.899999976 hla-unmapped: false hla-zygosity-threshold: 0.150000006 ht-anchor-bin-bits: 0 ht-cost-coeff-seed-freq: 0.5 ht-cost-coeff-seed-len: 1 ht-cost-penalty: 0 ht-cost-penalty-incr: 0.69999999999999996 ht-crc-extended: 0 ht-crc-primary: 0 ht-dump-int-params: 0 ht-ext-rec-cost: 4 ht-ext-table-alloc: 0 ht-max-dec-factor: 1 ht-max-ext-incr: 12 ht-max-ext-seed-len: 0 ht-max-seed-freq: 16 ht-max-seed-freq-len: 98 ht-max-table-chunks: 0 ht-mem-limit: 0GB ht-methylated: false ht-methylated-combined: false ht-min-repair-prob: 0.20000000000000001 ht-override-size-check: 0 ht-pri-max-seed-freq: 0 ht-rand-hit-extend: 8 ht-rand-hit-hifreq: 1 ht-ref-seed-interval: 1 ht-repair-strategy: 0 ht-seed-len: 21 ht-size: 0GB ht-soft-seed-freq-cap: 12 ht-target-seed-freq: 4 ht-test-only: 0 ht-write-hash-bin: 0 http-server-port: 7993 ignore-version-check: false input-qname-suffix-delimiter: / intermediate-results-dir: /ephemeral/intermediate-results/ lic-instance-id-location: /opt/instance-identity linkedreads-correction-table1: linkedreads_corrections_1.txt linkedreads-correction-table2: linkedreads_corrections_2.txt linkedreads-correction-table3: linkedreads_corrections_3.txt linkedreads-enable: false logfile_prefix: /opt/edico/logs/ mapper_cigar: 0 max_bin_size: 943718400 max_ios_inflight: 1024 methylation-TAPS: false methylation-compress-cx-report: false methylation-generate-cytosine-report: false methylation-generate-mbias-report: false methylation-keep-ref-cytosine: false methylation-mapping-implementation: single-pass methylation-match-bismark: false methylation-protocol: none methylation-reports-only: false min-predicted-output-gb: 200 multiplier: 1 no-reset: false output-directory: L2201824_dragen output-file-prefix: PRJ222361 output-format: bam pair-by-name: true pair-suffix-delimiter: / partition-on-compression-bottleneck: false pe-stats-continuous-update: false pe-stats-interval-delay: 5 pe-stats-interval-memory: 10 pe-stats-interval-size: 25000 pe-stats-sample-size: 100000 pe-stats-update-log-only: false preserve-bqsr-tags: false preserve-map-align-order: false qc-indel-denovo-quality-threshold: 0.02 qc-snp-denovo-quality-threshold: 0.050000000000000003 read-trimmers: none recordset-memory: 1073741824 ref-dir: /ephemeral/ref/hg38-v8-altaware-cnv-anchored/ reg_errors_are_warnings: 0 remove-duplicates: false repeat-genotype-enable: false repeat-genotype-min-anchor-mapq: 60 repeat-genotype-min-baseq: 20 repeat-genotype-min-score: 0.90000000000000002 repeat-genotype-read-depth: 0 repeat-genotype-read-length: 0 repeat-genotype-region-extension-length: 1000 repeat-genotype-skip-unaligned: true repeat-genotype-specs: rna-ann-sj-min-len: 6 rna-cv-min-expression: 0 rna-gf-aggressive-filters: false rna-gf-blast-pairs: /opt/edico/config/RNA_blast_pairs_human.tsv.gz rna-gf-coverage-lookup-window: 1000 rna-gf-enriched-genes: rna-gf-enriched-only: true rna-gf-enriched-regions: rna-gf-exon-snap: 50 rna-gf-mate-overhang: 8 rna-gf-max-partners: 3 rna-gf-merge-calls: true rna-gf-min-alt-to-ref: 0.00999999978 rna-gf-min-anchor: 12 rna-gf-min-blast-pairs-eval: 1e-100 rna-gf-min-breakpoint-distance: 0 rna-gf-min-breakpoint-mapq: 20 rna-gf-min-cis-distance: 200000 rna-gf-min-covered-bases: 125 rna-gf-min-covered-bases-uncaptured: 79 rna-gf-min-neighbor-dist: 15 rna-gf-min-nonref-split-support: 3 rna-gf-min-score: 0.5 rna-gf-min-score-ratio: 0.150000006 rna-gf-min-support: 2 rna-gf-min-support-be: 10 rna-gf-min-unique-alignments: 2 rna-gf-num-threads: 4 rna-gf-ref-anchor: 8 rna-gf-report-antisense: false rna-gf-report-intergenic: false rna-gf-report-intronic: false rna-gf-restrict-genes: true rna-gf-score-model: rna-gf-split-percent-align: 0.00100000005 rna-library-type: A rna-mapq-unique: 0 rna-quantification-fld-max: 1000 rna-quantification-fld-mean: 250 rna-quantification-fld-sd: 25 rna-quantification-full-concordance: false rna-quantification-gc-bias: true rna-quantification-inference-max: 10000 rna-quantification-inference-min: 100 rna-quantification-init-uniform: 0 rna-quantification-tlen-min: 500 rna-quantification-use-em: 0 rna-repeat-genes: rna-repeat-intervals: /opt/edico/config/rna_repeats_GRCh38.bed rna_aligner_buffer_size: 67108864 rna_aligner_buffers: 0 rrna-filter-enable: true rrna-filter-mode: 3 rrna-filter-score-bonus-coeff: 0.025000000000000001 rrna-filter-score-bonus-const: 2 s2c_dbam_block_size: 65536 s2c_dbam_pool_size: 32 s2c_decomp_block_size: 262144 s2c_decomp_pool_size: 64 s2c_graph_block_size: 16384 s2c_graph_pool_size: 256 s2c_hmm_block_size: 16384 s2c_hmm_pool_size: 16384 s2c_phase1_packet_size: 16384 s2c_phase2_packet_size: 16384 s2c_smw_block_size: 16384 s2c_smw_pool_size: 2048 single-cell-count-introns: false single-cell-global-umi: false single-cell-number-cells: 3000 single-cell-threshold: ratio soft-read-trimmers: polyg sort_buffer_size: 1048576 stop-at-read: 0 strip-input-qname-suffixes: true sv-denovo-threshold: 20 sv-enable-rrm-for-insertions-in-cancer-calling-modes: true sv-enable-rrm-for-insertions-in-germline-calling-modes: true sv-filtered-contigs: *_random chrUn_* *_alt HLA-* sv-min-diploid-variant-score: 10 sv-min-pass-diploid-gt-score: 5 sv-min-pass-diploid-variant-score: 20 sv-min-pass-somatic-score: 30 sv-min-somatic-score: 10 sv-report-small-dup-as-ins: true sv-rna-min-candidate-variant-size: 1000 trim-disable-mapping: false trim-filter-dummy-len: 10 trim-filter-set-flag: true trim-min-len-read1: 20 trim-min-len-read2: 20 trim-min-length: 20 trim-min-quality: 0 trim-polyg-early-exit-threshold: -500 trim-polyg-g-score-r1-3prime: 15 trim-polyg-g-score-r1-5prime: 0 trim-polyg-g-score-r2-3prime: 15 trim-polyg-g-score-r2-5prime: 0 trim-polyg-kmer-len: 25 trim-polyg-kmer-non-g: 2 trim-polyg-min-trim-r1-3prime: 6 trim-polyg-min-trim-r1-5prime: 6 trim-polyg-min-trim-r2-3prime: 6 trim-polyg-min-trim-r2-5prime: 6 umi-base-representation-min-ratio: 0.5 umi-correction-scheme: lookup umi-correction-table: umi_correction_table.txt umi-emit-multiplicity: both umi-enable: false umi-enable-contextual-corrections: true umi-enable-probability-model-merging: false umi-enable-shift-corrections: true umi-enable-trimming: true umi-end-mask-length: 0 umi-generate-bam-tags: true umi-generate-error-rate-bam-tags: false umi-masked-base-qual: 1 umi-max-base-quality: 63 umi-max-collision-rate: 1000 umi-max-read-error-rate: 1 umi-mem-throttle-gb: 30 umi-min-consensus-base-quality: 0 umi-min-input-base-quality: 0 umi-min-map-quality: 0 umi-min-reads-per-region: 4096 umi-min-supporting-reads: 2 umi-padding: A umi-preserve-input-tags: false umi-probability-merging-duplex-merging-thres: 1 umi-probability-merging-max-transition-ratio: 100000000 umi-probability-merging-min-isize-freq: 0.001 umi-probability-merging-seq-error: 0.001 umi-probability-merging-simplex-fuzzy-merging-thres: 1 umi-probability-merging-simplex-merging-thres: 1 umi-r1-length: 0 umi-random-merge-factor: 2 umi-read-minority-min-ratio: 0.5 umi-soft-clip-ratio: 0.5 umi-source: qname umi-start-mask-length: 0 umi-stat-estimation-max-fragment-count: 10 umi-stat-estimation-max-fragment-size: 1000 umi-stat-estimation-max-interval-number: 10 umi-stat-estimation-min-probability-unique-fragment: 0.998 umi-stat-estimation-umi-jumping-estimation-method: simplex umi-trim-allowed-mismatches: 1 umi-verbose-metrics: false use-mock-config: false vc-active-only: false vc-decoy-contigs: NC_007605 hs37d5 chrUn_KN707*v1_decoy chrUn_JTFH0100*v1_decoy KN707*.1 JTFH0100*.1 chrEBV CMV HBV HCV* HIV* KSHV HTLV* MCV SV40 HPV* vc-emit-ref-confidence: NONE vc-emit-zero-coverage-intervals: true vc-enable-basecall-filter: false vc-enable-deterministic-run: true vc-enable-hw-hmm: true vc-enable-hw-hmm-dump-receiver-data: false vc-enable-hw-hmm-dump-sender-data: false vc-enable-hw-hmm-dump-worker-data: false vc-enable-hw-smw: true vc-enable-hw-smw-dump-receiver-data: false vc-enable-hw-smw-dump-sender-data: false vc-enable-hw-smw-dump-worker-data: false vc-hw-hmm-timeout: 100000 vc-limit-genomecov-output: false vc-max-alternate-alleles: 6 vc-max-haps-per-job: 2 vc-sw-cell-width: 16 vc-sw-instruction-set: sse2 vc-sw-mode: 0 vd-output-match-log: false vqsr-lod-cutoff: -5.0 vqsr-num-gaussians: 8,2,4,2 watchdog-active-timeout: 600 watchdog-dump-processes: true; watchdog-exit-on-hang: true watchdog-freemem-threshold: 2GB watchdog-idle-timeout: 600 watchdog-max-threads: 8388608 watchdog-poll-interval: 1 watchdog-resources-monitored: THREADS IO MEMORY watchdog-verbose-logging: true num-engines: 4 Aligner.aln-min-score: 22 Aligner.bed-base-div64: 531016768 Aligner.clip-pe-overhang: 0 Aligner.give-pair-score: 1 Aligner.idx-base-div64: 530247420 Aligner.intr-motif0-pen: 30 Aligner.intr-motif12-pen: 5 Aligner.intr-motif34-pen: 15 Aligner.intr-motif56-pen: 20 Aligner.mapq-coeff-scaled: 38912 Aligner.match-n-score: 31 Aligner.methyl-aln-mode: 0 Aligner.ref-base-div64: 505628316 Aligner.rrna-bonus-coeff: 512 Aligner.rrna-bonus-const: 640 Aligner.rrna-mode: 3 Aligner.rrna-ref-id: 170 Aligner.seed-len-inv: 195 Aligner.steer-delta: 12 Aligner.stitch-skip-bias: 25 Aligner.supp-min-score-adj: 0 Mapper.adapter-detect-0: 2284439564 0 Mapper.adapter-detect-1: 2809932556 0 Mapper.adapter-detect-10: 0 0 Mapper.adapter-detect-11: 0 0 Mapper.adapter-detect-12: 0 0 Mapper.adapter-detect-13: 0 0 Mapper.adapter-detect-14: 0 0 Mapper.adapter-detect-15: 0 0 Mapper.adapter-detect-2: 3533599244 0 Mapper.adapter-detect-3: 870182156 0 Mapper.adapter-detect-4: 0 0 Mapper.adapter-detect-5: 0 0 Mapper.adapter-detect-6: 0 0 Mapper.adapter-detect-7: 0 0 Mapper.adapter-detect-8: 0 0 Mapper.adapter-detect-9: 0 0 Mapper.adapter-min-trim: 4 Mapper.adapter-trim-r1-3p-s0-p0: 0 Mapper.adapter-trim-r1-3p-s0-p1: 0 Mapper.adapter-trim-r1-3p-s0-p2: 0 Mapper.adapter-trim-r1-3p-s0-p3: 0 Mapper.adapter-trim-r1-3p-s1-p0: 0 Mapper.adapter-trim-r1-3p-s1-p1: 0 Mapper.adapter-trim-r1-3p-s1-p2: 0 Mapper.adapter-trim-r1-3p-s1-p3: 0 Mapper.adapter-trim-r1-3p-s2-p0: 0 Mapper.adapter-trim-r1-3p-s2-p1: 0 Mapper.adapter-trim-r1-3p-s2-p2: 0 Mapper.adapter-trim-r1-3p-s2-p3: 0 Mapper.adapter-trim-r1-3p-s3-p0: 0 Mapper.adapter-trim-r1-3p-s3-p1: 0 Mapper.adapter-trim-r1-3p-s3-p2: 0 Mapper.adapter-trim-r1-3p-s3-p3: 0 Mapper.adapter-trim-r1-3p-s4-p0: 0 Mapper.adapter-trim-r1-3p-s4-p1: 0 Mapper.adapter-trim-r1-3p-s4-p2: 0 Mapper.adapter-trim-r1-3p-s4-p3: 0 Mapper.adapter-trim-r1-3p-s5-p0: 0 Mapper.adapter-trim-r1-3p-s5-p1: 0 Mapper.adapter-trim-r1-3p-s5-p2: 0 Mapper.adapter-trim-r1-3p-s5-p3: 0 Mapper.adapter-trim-r1-3p-s6-p0: 0 Mapper.adapter-trim-r1-3p-s6-p1: 0 Mapper.adapter-trim-r1-3p-s6-p2: 0 Mapper.adapter-trim-r1-3p-s6-p3: 0 Mapper.adapter-trim-r1-3p-s7-p0: 0 Mapper.adapter-trim-r1-3p-s7-p1: 0 Mapper.adapter-trim-r1-3p-s7-p2: 0 Mapper.adapter-trim-r1-3p-s7-p3: 0 Mapper.adapter-trim-r1-5p-s0-p0: 0 Mapper.adapter-trim-r1-5p-s0-p1: 0 Mapper.adapter-trim-r1-5p-s0-p2: 0 Mapper.adapter-trim-r1-5p-s0-p3: 0 Mapper.adapter-trim-r1-5p-s1-p0: 0 Mapper.adapter-trim-r1-5p-s1-p1: 0 Mapper.adapter-trim-r1-5p-s1-p2: 0 Mapper.adapter-trim-r1-5p-s1-p3: 0 Mapper.adapter-trim-r1-5p-s2-p0: 0 Mapper.adapter-trim-r1-5p-s2-p1: 0 Mapper.adapter-trim-r1-5p-s2-p2: 0 Mapper.adapter-trim-r1-5p-s2-p3: 0 Mapper.adapter-trim-r1-5p-s3-p0: 0 Mapper.adapter-trim-r1-5p-s3-p1: 0 Mapper.adapter-trim-r1-5p-s3-p2: 0 Mapper.adapter-trim-r1-5p-s3-p3: 0 Mapper.adapter-trim-r1-5p-s4-p0: 0 Mapper.adapter-trim-r1-5p-s4-p1: 0 Mapper.adapter-trim-r1-5p-s4-p2: 0 Mapper.adapter-trim-r1-5p-s4-p3: 0 Mapper.adapter-trim-r1-5p-s5-p0: 0 Mapper.adapter-trim-r1-5p-s5-p1: 0 Mapper.adapter-trim-r1-5p-s5-p2: 0 Mapper.adapter-trim-r1-5p-s5-p3: 0 Mapper.adapter-trim-r1-5p-s6-p0: 0 Mapper.adapter-trim-r1-5p-s6-p1: 0 Mapper.adapter-trim-r1-5p-s6-p2: 0 Mapper.adapter-trim-r1-5p-s6-p3: 0 Mapper.adapter-trim-r1-5p-s7-p0: 0 Mapper.adapter-trim-r1-5p-s7-p1: 0 Mapper.adapter-trim-r1-5p-s7-p2: 0 Mapper.adapter-trim-r1-5p-s7-p3: 0 Mapper.adapter-trim-r2-3p-s0-p0: 0 Mapper.adapter-trim-r2-3p-s0-p1: 0 Mapper.adapter-trim-r2-3p-s0-p2: 0 Mapper.adapter-trim-r2-3p-s0-p3: 0 Mapper.adapter-trim-r2-3p-s1-p0: 0 Mapper.adapter-trim-r2-3p-s1-p1: 0 Mapper.adapter-trim-r2-3p-s1-p2: 0 Mapper.adapter-trim-r2-3p-s1-p3: 0 Mapper.adapter-trim-r2-3p-s2-p0: 0 Mapper.adapter-trim-r2-3p-s2-p1: 0 Mapper.adapter-trim-r2-3p-s2-p2: 0 Mapper.adapter-trim-r2-3p-s2-p3: 0 Mapper.adapter-trim-r2-3p-s3-p0: 0 Mapper.adapter-trim-r2-3p-s3-p1: 0 Mapper.adapter-trim-r2-3p-s3-p2: 0 Mapper.adapter-trim-r2-3p-s3-p3: 0 Mapper.adapter-trim-r2-3p-s4-p0: 0 Mapper.adapter-trim-r2-3p-s4-p1: 0 Mapper.adapter-trim-r2-3p-s4-p2: 0 Mapper.adapter-trim-r2-3p-s4-p3: 0 Mapper.adapter-trim-r2-3p-s5-p0: 0 Mapper.adapter-trim-r2-3p-s5-p1: 0 Mapper.adapter-trim-r2-3p-s5-p2: 0 Mapper.adapter-trim-r2-3p-s5-p3: 0 Mapper.adapter-trim-r2-3p-s6-p0: 0 Mapper.adapter-trim-r2-3p-s6-p1: 0 Mapper.adapter-trim-r2-3p-s6-p2: 0 Mapper.adapter-trim-r2-3p-s6-p3: 0 Mapper.adapter-trim-r2-3p-s7-p0: 0 Mapper.adapter-trim-r2-3p-s7-p1: 0 Mapper.adapter-trim-r2-3p-s7-p2: 0 Mapper.adapter-trim-r2-3p-s7-p3: 0 Mapper.adapter-trim-r2-5p-s0-p0: 0 Mapper.adapter-trim-r2-5p-s0-p1: 0 Mapper.adapter-trim-r2-5p-s0-p2: 0 Mapper.adapter-trim-r2-5p-s0-p3: 0 Mapper.adapter-trim-r2-5p-s1-p0: 0 Mapper.adapter-trim-r2-5p-s1-p1: 0 Mapper.adapter-trim-r2-5p-s1-p2: 0 Mapper.adapter-trim-r2-5p-s1-p3: 0 Mapper.adapter-trim-r2-5p-s2-p0: 0 Mapper.adapter-trim-r2-5p-s2-p1: 0 Mapper.adapter-trim-r2-5p-s2-p2: 0 Mapper.adapter-trim-r2-5p-s2-p3: 0 Mapper.adapter-trim-r2-5p-s3-p0: 0 Mapper.adapter-trim-r2-5p-s3-p1: 0 Mapper.adapter-trim-r2-5p-s3-p2: 0 Mapper.adapter-trim-r2-5p-s3-p3: 0 Mapper.adapter-trim-r2-5p-s4-p0: 0 Mapper.adapter-trim-r2-5p-s4-p1: 0 Mapper.adapter-trim-r2-5p-s4-p2: 0 Mapper.adapter-trim-r2-5p-s4-p3: 0 Mapper.adapter-trim-r2-5p-s5-p0: 0 Mapper.adapter-trim-r2-5p-s5-p1: 0 Mapper.adapter-trim-r2-5p-s5-p2: 0 Mapper.adapter-trim-r2-5p-s5-p3: 0 Mapper.adapter-trim-r2-5p-s6-p0: 0 Mapper.adapter-trim-r2-5p-s6-p1: 0 Mapper.adapter-trim-r2-5p-s6-p2: 0 Mapper.adapter-trim-r2-5p-s6-p3: 0 Mapper.adapter-trim-r2-5p-s7-p0: 0 Mapper.adapter-trim-r2-5p-s7-p1: 0 Mapper.adapter-trim-r2-5p-s7-p2: 0 Mapper.adapter-trim-r2-5p-s7-p3: 0 Mapper.alt-aware: 1 Mapper.chain-filt-ratio: 1024 Mapper.pri-crc-poly: 2C991CE6A8DD55 Mapper.sec-crc-poly: 1524CA66E8D39 Mapper.cut-bases: 0 Mapper.edit-pattern: 1 Mapper.edit-period: 15 Mapper.edit-seeds-limit: 6 Mapper.enable-anchors: 0 Mapper.enable-ann-sjs: 1 Mapper.extab-base-div64: 444596224 Mapper.filter-dummy-len: 10 Mapper.max-read-len: 4294967295 Mapper.methyl-map-mode: 0 Mapper.min-read-len: 0 Mapper.min-trim-bases: 0 Mapper.polyg-early-exit: 4294966796 Mapper.polyg-g-score: 251662080 Mapper.polyg-kmer-len: 25 Mapper.polyg-kmer-non-g: 2 Mapper.polyg-min-trim: 101058054 Mapper.pos-bin-gran-sel: 7 Mapper.pri-crc-bits: 54 Mapper.pri-seed-bases: 21 Mapper.qual-cutoff: 16711935 Mapper.ref-alt-seed: 3038827520 Mapper.ref-seed-interval: 16 Mapper.rna-mode: 1 Mapper.sec-crc-bits: 49 Mapper.seed-pattern: 1 Mapper.seed-period: 2 Mapper.sjs-base-div64: 531112940 Mapper.table-base-div64: 0 Mapper.table-size-64ths: 53 Mapper.trim-flags: 256 Mapper.trim-steps: 4294967281 Mapper.trim-steps-soft: 2
Dragen represents chimeric alignments in a slightly different way than STAR does. Arriba is not prepared for this change. I could adapt Arriba, but the easier solution is to run Arriba as intended. Would it be acceptable for you to implement your workflow to use the run_arriba.sh
script as described in the manual? This script takes FastQs as input and aligns them with STAR.
Thanks for looking into it, Sebastian. Interesting that we have just hit this issue as we have run >100 samples through Arriba using Dragen BAMs and we have found it to be compatible. I assume this wouldn't be the first time we have chimeric reads in the data.
Would it be acceptable for you to implement your workflow to use the run_arriba.sh script as described in the manual?
We can but the issue here is Dragen bam is used by curation team for manual inspection and confirmation of WTS results. This was the major motivation for using Dragen bam as an input to Arriba, so that we are consistent across results.
Dragen bam is used by curation team for manual inspection
This is a good argument to get it to work with Dragen. Another general argument is that Arriba builds on open standards (SAM format) and it would be nice to see it work with different aligners which are also standards-compliant.
I tried to reproduce the issue on my side, but failed to do so. In my tests, these alignments are discarded as malformed. This makes sense given that there are several issues with the SAM records. Most notably, the HI
tags are inconsistent and not compliant with the SAM specification as far as I am aware. I have no explanation why my observations are different from yours. This read should never make it to the mismappers filter, when you process the sample. It should be discarded in the very beginning.
I don't have a Dragen license. I played around with the open-source dragmap implementation, instead. But I found that it is not suited for fusion detection from RNA-Seq data for various reasons. I am not sure if the commercial dragmap implementation is better in this regard. But given the malformed HI
tag in your output and my bad experience with the open-source implementation, I am concerned that dragmap may not be suitable for fusion detection from RNA-Seq after all. This would be a strong argument against using the Dragen alignments. I get your point about consistency, but I would rather have good inconsistent alignments than poor-quality consistent alignments. To get a better feeling for the commercial dragmap implementation, may I ask you to do me a favor and align a public sample for me using Dragen and send me the BAM? For example this one from the MCF-7 cell line:
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR358/ERR358487/ERR358487_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR358/ERR358487/ERR358487_2.fastq.gz
If it turns out that the alignments from the commercial dragmap are just as unsuitable for fusion detection as those from the open-source implementation, I don't think it makes sense to invest a lot of time into fixing problems arising from these alignments. Instead, I would strongly advise to stick to the recommended workflow based on STAR. In contrast, if the commercial implementation is better, then we can dig deeper and try to solve your original problem. I'm hoping for the latter, because it would be really cool to see Arriba work with more than just one aligner (STAR).
I have got hold of the SRA data and ran it through our Dragen workflow. The workflow has completed successfully, including Arriba step, which is positive, I think.
I am attaching a file with presigned urls for the BAM and its index file. Please let me know if you've any access issues. presigned_urls.csv
Most notably, the HI tags are inconsistent and not compliant with the SAM specification as far as I am aware.
May I ask which specific issues are you referring to? Is it the i'
in tag or another format issue? We are happy to also open up a discussion with Dragen team, in case they are willing to make adjustment to their alignment format. Think, matching standard SAM specification is a good enough argument. However, will need specific pointers to what we would like to see adjusted/fixed.
Thanks.
I have got hold of the SRA data and ran it through our Dragen workflow.
Thanks for providing the alignments. This was helpful. I had a look at them and I can confirm that the commercial dragmap is much better for fusion detection than the open-source version. Compared to STAR, there are two short-comings:
In summary, it should be fine to use dragmap (after I have implemented a fix for issue 1). You will get superior results with STAR, however. I need to run some more benchmarks to tell how big the difference is.
May I ask which specific issues are you referring to? Is it the i' in tag or another format issue?
After carefully studying the SAM format specification, I came to the conclusion that the alignments are in accordance with the spec. However, I came across some alignments in the MCF-7 sample you sent me, which did not seem correct: They had an NH
tag with a value of 0, even though they were not unmapped. IMO, this makes no sense. Here is an example:
ERR358487.453798 163 chr1 2448986 60 100M = 2449132 246 GGCCCTGGAACGCATCCCAGGATCAGTGGCTCTGCAGGCTCCAGGCCCCACCAGCTGCTCTGAGGAGGCTCAGATGGGTCTGGGAGTCCTGGTCGGAGCT CCCFFFFFHHHHHJJIJJIJJJJJJIIJJJJJJJJJJJJJJJJIIIJJJGGIGEHCCEHHEEFEFFDDDDDDDDCDDD?CCCDDDB?CCCCDCDDCBDBB RG:Z:ERR358487 AS:i:100 ps:i:195 HI:i:0 NH:i:0 NM:i:0 SD:f:0 XQ:i:250 SA:Z:chr1_KI270762v1_alt,176,+,100M,60,0;
This problem is irrelevant to fusion detection with Arriba, however. So don't bother opening a ticket with the dragmap developers.
Coming back to the original error: As mentioned previously, I was unable to reproduce it on my end. But the MCF-7 data you sent me gave me an idea about what could be the issue on your end. I have implemented a fix for this and made it available under the new docker tag uhrigs/arriba-debug:2.3.0fix
. Can you please run this version on your problematic sample PRJ222361.bam
and let me know if it goes through? This new debug version also generates some debug output, which will possibly help me locate the issue on your end. For this reason, I am asking you to also send me the full command-line output of Arriba. Thanks.
That's incredibly helpful, thanks so much! We'd previously ran a comparison of DRAGEN vs STAR alignments and their impact on Arriba and didn't notice any difference, but that was a small sample set and likely did not include multi-mapped chimeric reads.
Thanks Sebastian. I have rerun our internal problematic sample. Arriba completed successfully this time ✨ and produced following commandline output:
[2023-01-12T00:14:11] Launching Arriba 2.3.0
[2023-01-12T00:14:11] Loading assembly from '/data/cwl-stage-dir/stg05543d6c-2114-4828-80ea-d7f9aec127ac/hg38.fa'
[2023-01-12T00:14:26] Loading annotation from '/data/cwl-stage-dir/stg886a76f2-e7f2-49bd-adfd-dc9dd9ae79d0/ref-transcripts.non-zero-length.gtf'
[2023-01-12T00:14:32] Reading chimeric alignments from '/data/cwl-stage-dir/stgcda142a9-eddc-4965-9a0d-c8a913c3820f/PRJ222361.bam' clipping 111
clipping 56
clipping 86
clipping 36
clipping 11
clipping 106
clipping 81
clipping 31
433
501
513
clipping 111
clipping 56
clipping 81
433
436
443
488
513
WARNING: 9879264 SAM records were malformed and ignored
(total=1653336)
[2023-01-12T00:19:50] Marking multi-mapping alignments (marked=352011)
[2023-01-12T00:19:51] Detecting strandedness (reverse)
[2023-01-12T00:19:51] Assigning strands to alignments
[2023-01-12T00:19:51] Annotating alignments
[2023-01-12T00:19:58] Filtering duplicates (remaining=601197)
[2023-01-12T00:20:00] Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_*) (remaining=577930)
[2023-01-12T00:20:00] Filtering mates which only map to viral contigs (AC_* NC_*) (remaining=577930)
[2023-01-12T00:20:00] Filtering viral contigs with expression lower than the top 5 (remaining=577930)
[2023-01-12T00:20:01] Filtering viral contigs with less than 5% coverage (remaining=577930)
[2023-01-12T00:20:02] Estimating fragment length (mate gap mean=-70.5273, mate gap stddev=39.5252, read length mean=141.795)
[2023-01-12T00:20:02] Filtering read-through fragments with a distance <=10000bp (remaining=556251)
[2023-01-12T00:20:03] Filtering inconsistently clipped mates (remaining=482111)
[2023-01-12T00:20:03] Filtering breakpoints adjacent to homopolymers >=6nt (remaining=429376)
[2023-01-12T00:20:03] Filtering fragments with small insert size (remaining=395425)
[2023-01-12T00:20:04] Filtering alignments with long gaps (remaining=395425)
[2023-01-12T00:20:04] Filtering fragments with both mates in the same gene (remaining=375629)
[2023-01-12T00:20:04] Filtering fusions arising from hairpin structures (remaining=312072)
[2023-01-12T00:20:05] Filtering reads with a mismatch p-value <=0.01 (remaining=184324)
[2023-01-12T00:20:06] Filtering reads with low entropy (k-mer content >=60%) (remaining=150111)
WARNING: some fusions were subsampled, because they have more than 300 supporting reads
[2023-01-12T00:20:08] Finding fusions and counting supporting reads (total=178359)
[2023-01-12T00:20:13] Merging adjacent fusion breakpoints (remaining=178317)
[2023-01-12T00:20:13] Filtering multi-mapping fusions by alignment score and read support (remaining=132442)
[2023-01-12T00:20:18] Estimating expected number of fusions by random chance (e-value)
[2023-01-12T00:20:20] Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=127707)
[2023-01-12T00:20:20] Filtering intragenic fusions with both breakpoints in exonic regions (remaining=126007)
[2023-01-12T00:20:20] Filtering fusions with <2 supporting reads (remaining=66263)
[2023-01-12T00:20:20] Filtering fusions with an e-value >=0.3 (remaining=51170)
[2023-01-12T00:20:20] Searching for internal tandem duplications <=100bp with >=10 supporting reads and >=7% allele fraction (remaining=51174)
[2023-01-12T00:20:21] Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=33583)
[2023-01-12T00:20:21] Filtering in vitro-generated fusions between genes with an expression above the 99.8% quantile (remaining=765)
[2023-01-12T00:20:22] Searching for fusions with spliced split reads (remaining=768)
[2023-01-12T00:20:24] Selecting best breakpoints from genes with multiple breakpoints (remaining=178)
[2023-01-12T00:20:24] Filtering read-through fusions with breakpoints near the gene boundary (remaining=177)
[2023-01-12T00:20:24] Searching for fusions with >=4 spliced events (remaining=177)
[2023-01-12T00:20:25] Filtering blacklisted fusions in '/data/cwl-stage-dir/stg63d2173b-6ec9-42e2-9016-57884a41141d/blacklist_hg38_GRCh38_v2.3.0.tsv.gz' (remaining=131)
[2023-01-12T00:20:46] Filtering fusions with anchors <=23nt (remaining=128)
[2023-01-12T00:20:46] Filtering end-to-end fusions with low support (remaining=111)
[2023-01-12T00:20:46] Filtering fusions with no coverage around the breakpoints (remaining=109)
[2023-01-12T00:20:46] Indexing gene sequences
[2023-01-12T00:20:49] Filtering genes with >=30% identity (remaining=28)
[2023-01-12T00:20:49] Re-aligning chimeric reads to filter fusions with >=80% mis-mappers (remaining=5)
[2023-01-12T00:20:49] Selecting best breakpoints from genes with multiple breakpoints (remaining=5)
[2023-01-12T00:20:49] Searching for additional isoforms (remaining=5)
[2023-01-12T00:20:50] Assigning confidence scores to events
[2023-01-12T00:20:50] Writing fusions to file 'fusions.tsv'
WARNING: encountered early stop codon in transcript ENST00000646881 at amino acid 49 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript ENST00000646881 at amino acid 49 (error in GTF file?) => predicted peptide sequence may be wrong
[2023-01-12T00:20:50] Writing discarded fusions to file 'fusions-discarded.tsv'
[2023-01-12T00:20:56] Freeing resources
[2023-01-12T00:21:02] Done (elapsed time=00:06:51, CPU time=00:06:50, peak memory=7.97gb)
Good to know that you believe dragen alignments are in accordance with the SAM spec.
In summary, it should be fine to use dragmap (after I have implemented a fix for issue 1). You will get superior results with STAR, however. I need to run some more benchmarks to tell how big the difference is.
Again, incredibly helpful indeed. We hope comparison between STAR and Dragen aligner isn't completely off/one sided.
I'm glad to hear that this fix solved the problem. The debug output is also what I expected. The fix drops the offending read for the reason I suspected. I will soon release the next version of Arriba, which will incorporate the fix. I will close this issue as solved for now.
I also had a look at the impact of using Dragen instead of STAR on fusion detection. Apart from the decreased sensitivity for multi-mapping chimeric reads, I also noticed that Dragen is in general inferior when it comes to aligning split reads. By running Arriba on the BAM file you sent me, I realized that the split reads for some of the fusions are much lower than when I use STAR for alignment. I'm not sure if there are Dragen parameters that could be tweaked; and even if, I don't have the means to benchmark such tweaks, because I do not have a Dragen license. So there is not much I can do. On the plus side, Dragen seems to be better at aligning discordant mates (or maybe it's just that some reads which STAR aligns as split reads are aligned as discordant mates by Dragen - I did not check this). Nevertheless, split reads are more useful for fusion detection, so I prefer to have more of those. That being said, I think it should be okay to use Dragen if someone does not want to use STAR, because the whole pipeline is built on Dragen. Sure, there is a bit of a sensitivity loss, but Dragen is certainly better than some other aligners that I have tried out in the past for fusion detection. When it comes to detecting driving fusions (which are usually expressed at a high level and have good read support), I doubt you will notice much of difference between STAR and Dragen. It's only for fusions with low expression, such as some loss-of-function fusions, or reads that are hard to align (multi-mapping, IG loci/TCR loci) that I expect to see a notable difference in the detection rate.
Thanks again for taking the time to report this and making me aware of Dragen as an alternative to STAR. Having the flexibility to use one's favorite aligner is precisely the reason why I built Arriba in a way which is compliant with open standards, namely the SAM format, rather than some proprietary file format, which is what most other fusion detection tools do. It's good to see that this idea finally comes to fruition.
To add some more details about the STAR vs. Dragen comparison. Here is the output from the latest development version of Arriba on the alignments from sample ERR358487 MCF-7 from both aligners:
fusions_star.txt fusions_dragen.txt
Note how the split read counts are consistently lower for Dragen than for STAR, e.g., for the fusion FOXA1-TTC6 STAR finds 288+218 split reads, whereas Dragen only finds 153+141. It's similar for BCAS4-BCAS3 or ARFGEF2-SULF2 to name a few. This short-coming of Dragen will ultimately translate into poorer sensitivity, especially for fusions with just a few reads as support. On the other hand, Dragen seems to find more discordant mates, e.g., SMARCA4-CARM1 has just 60 discordant mates from STAR, but 81 from Dragen. It seems Dragen prefers discordant mate alignment where it should make a split read alignment.
Hi Sebastian,
Thanks so much for adding information to the issue. As mentioned in my email, we have got specific pointers to initiate further discussion with Dragen team.
Will let you know if there are other questions.
Hello,
At UMCCR, we use arriba as a fusion caller for WTS clinical samples for cancer patients. Arriba is part of our WTS production pipeline setup and works great.
We have recently encountered a case (using docker image
uhrigs/arriba:2.3.0
), where it crashes with following error:Any idea, what might be causing it? Quick google search suggests it might be a C++ library issue. This sample does have low quality (effective coverage is ~50% of raw coverage between duplication, unmapped reads and chimeric reads.), we were wondering if it was related to chimeric fusion artefacts? Help will be appreciated.
Best, Sehrish