kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
100 stars 12 forks source link

Zero Division Error During Call Component #16

Closed pblaney closed 2 years ago

pblaney commented 2 years ago

Hello Kez,

Exciting work you've put together here with dysgu. Very happy to have come across your preprint as I've been looking for a tool that would generate such useful read metrics for the major SV classes. I have been trying to get the software to complete on one of my samples and continue to run into this same error upon running the call step.

dysgu fetch working_dir PD26400a_T.final.bam
2021-12-17 11:44:15,327 [INFO   ]  [dysgu-fetch] Version: 1.3.0
2021-12-17 11:59:25,375 [INFO   ]  dysgu fetch PD26400a_T.final.bam written to working_dir/PD26400a_T.final.dysgu_reads.bam, n=6183390, time=0:15:10 h:m:s

dysgu call --ibam PD26400a_T.final.bam --sites PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf Homo_sapiens_assembly38.fasta working_dir workin
g_dir/PD26400a_T.final.dysgu_reads.bam > svs.vcf
2021-12-17 12:01:57,321 [INFO   ]  [dysgu-call] Version: 1.3.0
2021-12-17 12:01:57,321 [INFO   ]  Input file is: working_dir/PD26400a_T.final.dysgu_reads.bam
2021-12-17 12:01:57,321 [INFO   ]  call --ibam PD26400a_T.final.bam --sites PD26400a_T_vs_PD26400b_N.consensus.somatic.sv.vcf Homo_sapiens_assembly38.fasta working_dir working_dir/PD26400a_T.final.dysgu_reads.bam
[W::hts_idx_load3] The index file is older than the data file: PD26400a_T.final.bai
2021-12-17 12:01:57,573 [INFO   ]  Sample name: PD26400a_T
2021-12-17 12:01:57,573 [INFO   ]  Writing vcf to stdout
2021-12-17 12:01:57,573 [INFO   ]  Running pipeline
2021-12-17 12:01:58,283 [INFO   ]  Calculating insert size. Removed 27 outliers with insert size >= 938.0
2021-12-17 12:01:58,299 [INFO   ]  Inferred read length 151.0, insert median 458, insert stdev 89
2021-12-17 12:01:58,301 [INFO   ]  Max clustering dist 903
2021-12-17 12:01:58,301 [INFO   ]  Minimum support 3
2021-12-17 12:01:58,301 [INFO   ]  Reading --sites
[W::vcf_parse_info] INFO/END=34143939 is smaller than POS at chr1:84232901
2021-12-17 12:01:58,351 [INFO   ]  Building graph with clustering 903 bp
2021-12-17 12:03:57,234 [INFO   ]  Total input reads 6183142
2021-12-17 12:03:59,688 [INFO   ]  Added 57 variants from input sites
2021-12-17 12:04:01,797 [INFO   ]  Graph constructed
Traceback (most recent call last):
  File "/conda/bin/dysgu", line 8, in <module>
    sys.exit(cli())
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/conda/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/conda/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/conda/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/conda/lib/python3.7/site-packages/dysgu/main.py", line 442, in call_events
    cluster.cluster_reads(ctx.obj)
  File "dysgu/cluster.pyx", line 1115, in dysgu.cluster.cluster_reads
  File "dysgu/cluster.pyx", line 898, in dysgu.cluster.pipe1
  File "dysgu/cluster.pyx", line 633, in dysgu.cluster.component_job
  File "dysgu/call_component.pyx", line 1914, in dysgu.call_component.call_from_block_model
  File "dysgu/call_component.pyx", line 1928, in dysgu.call_component.call_from_block_model
  File "dysgu/call_component.pyx", line 1880, in dysgu.call_component.multi
  File "dysgu/call_component.pyx", line 988, in dysgu.call_component.single
  File "dysgu/call_component.pyx", line 608, in dysgu.call_component.make_single_call
  File "dysgu/call_component.pyx", line 286, in dysgu.call_component.count_attributes2
ZeroDivisionError: float division

To troubleshoot, I first ran dysgu test to determine if my build was fully successful and this completed without error. Next, I checked the output bam from the fetch step with some samtools commands. It's wasn't corrupt and had a reasonable amount of reads (6183390). Then I went into the script call_component.pyx and looked into the code block where the error originated:

if clipped_bases > 0 and aligned_bases > 0:
        er.clip_qual_ratio = (aligned_base_quals / aligned_bases) / (clipped_base_quals / clipped_bases)
    else:
        er.clip_qual_ratio = 0

From that I can see it's one of aligned_bases, clipped_base_quals, or clipped_bases I couldn't see a reason any to these values would be zero based on the BAM that was produced so I wanted to present the issue here.

Let me know if there is any other information or tests you would like me to try. Best regards, Patrick

Below are packages installed for build via Anaconda and pip:

Cython                 0.29.26
click                  8.0.3
numpy                  1.21.2
pandas                 1.3.5
pysam                  0.18.0
networkx               2.6.3
scikit-learn           1.0.1
ncls                   0.0.62
scikit-bio             0.5.6
edlib                  1.3.9
sortedcontainers       2.4.0
lightgbm               3.3.1
kcleal commented 2 years ago

Hi Patrick, Thanks for raising this and providing a detailed report. It looks like it is clipped_base_quals, when that is 0 then a ZeroDivisionError arises. Sorry for missing this! I have pushed a fix for this to github. A release will be uploaded to pypi shortly (v1.3.3).

Could I ask what sequencing platform you are using? The error implies that there are some clipped_bases (soft-clipped bases) but the sum of their base qualities is zero.

pblaney commented 2 years ago

Hi Kez, I greatly appreciate the fast response. This data was sequenced on Illumina. Also, I just checked the output BAM in IGV and I can see the soft-clipped bases. It appears their individual base qualities are non-zero so the sum most certainly shouldn't be.

image

I am unsure if this is a correct interpretation so let me know if you have a preferred method for evaluating these metrics.

EDIT: After searching the BAM there are some reads that have no clipping. Perhaps it's these reads that are being caught and thus the sum would be zero?

image
kcleal commented 2 years ago

Thanks for the extra info, I think your data looks 'normal' as far as I can see. I have released v1.3.3 to pypi, so hopefully that resolves issue. I will close this for now, but please re-open if this doesn't fix it.

Also, as you mentioned in your comment that you are interested in the metrics dysgu provides, you might want to add --metrics flag to your run/call command to get the full list of metrics in the output, without this a more limited set is reported.

pblaney commented 2 years ago

Hey Kez,

Just wanted to update you and confirm the error has been resolved and the call function successfully moved past this part of the function. However there was an additional error that arose so I will open a new issue to detail this for you. I think it'll be a small fix so hopefully it won't take up too much of your time.

Thanks again for addressing this, Patrick