single-cell-genetics / cellSNP

Pileup biallelic SNPs from single-cell and bulk RNA-seq data
Apache License 2.0
71 stars 11 forks source link

RuntimeWarning: divide by zero encountered in log #4

Closed Griffan closed 5 years ago

Griffan commented 5 years ago
[cellSNP] mode 2: pileup 22 whole chromosomes in 23111 single cells.
Traceback (most recent call last):
  File "/net/1000g/fanzhang/bin/miniconda2/bin/cellSNP", line 9, in <module>
    load_entry_point('cellSNP==0.1.4', 'console_scripts', 'cellSNP')()
  File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/cellSNP.py", line 196, in main
    max_FLAG, min_LEN, doubletGL, True)
  File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_regions.py", line 111, in pileup_regions
    cell_list, UMIs_list, barcodes)
  File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 309, in map_barcodes
    qual_cells[_idx][BASE_IDX[_base]] += qual_vector(_qual)
  File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 101, in qual_vector
    RV = [np.log(1-Q), np.log(3/4 - 2/3*Q), np.log(1/2 - 1/3*Q), np.log(Q)]
RuntimeWarning: divide by zero encountered in log

Hello, I encountered this error, what could be the reason?

huangyh09 commented 5 years ago

Hi, it might be related to the warning: "RuntimeWarning: divide by zero encountered in log". Before that you can try a few diagnoses: 1) change to single CPU mode by setting -p 1, see if you see the same error. 2) try it in Python 3 environment, which seems not likely the case though

Alternatively, I wonder which species you are working on. If human, I would normally pile up a list candidate SNPs, e.g., from 1000 genome project, to avoid potential RNA editing variants contamination.

Griffan commented 5 years ago

I tried all of these suggestions, none of them helped. Would you mind to explain a little bit what this part of code exactly do?

huangyh09 commented 5 years ago

one possibility is that your reads have some bases with lowest quality, hence Q=0: https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm

I'm not sure if this is the case? can you provide more details on the logs? is this failed from the begining or it runs a while and failed in the middle?

Griffan commented 5 years ago

The log I provided in the very beginning is all the information it outputs. If Q=0 is the case, is it possible to release a hot-fix to defend this situation?

huangyh09 commented 5 years ago

OK, I have added a minQ to avoid 0 value: https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L100

Could you re-install from this repo (not on pypi yet), and see if it works? You may need to pip uninstall cellSNP first.

Griffan commented 5 years ago

I double checked the code, after pip uninstall and pip install, the line you modified didn't sync with your branch. I manually added this line, I will try again.

Griffan commented 5 years ago

After trying various ways to update and confirm that the code has included this patch, cellSNP still complains the very beginning message:

[cellSNP] mode 1: fetch given SNPs in 23111 single cells. [cellSNP] loading the VCF file for given SNPs ... [cellSNP] fetching 36312488 candidate variants ... Traceback (most recent call last): File "/net/1000g/fanzhang/bin/miniconda2/bin/cellSNP", line 9, in load_entry_point('cellSNP==0.1.4', 'console_scripts', 'cellSNP')() File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/cellSNP.py", line 209, in main min_MAPQ, max_FLAG, min_LEN, doubletGL, True) File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 240, in fetch_positions qual_list, cell_list, UMIs_list, barcodes) File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 309, in map_barcodes qual_cells[_idx][BASE_IDX[_base]] += qual_vector(_qual) File "/net/1000g/fanzhang/bin/miniconda2/lib/python2.7/site-packages/cellSNP-0.1.4-py2.7.egg/cellSNP/utils/pileup_utils.py", line 101, in qual_vector RV = [np.log(1-Q), np.log(3/4 - 2/3Q), np.log(1/2 - 1/3Q), np.log(Q)] RuntimeWarning: divide by zero encountered in log

I think this means Q also should be banned from more values. is it possible to explain how do these variables map to your paper formula? Thanks!

huangyh09 commented 5 years ago

That's very strange. I have given a lower bound 0.00001. This is a developing feature and has been used for the donor deconvolution.

In principle, you can turn off some of these codes for diagnosis. Alternatively, you can send me a few reads in your bam file, and I may unsterstand this issue better. Thanks.

Griffan commented 5 years ago

I mean it is not necessarily because of the term np.log(Q), I think np.log(1-Q), np.log(3/4 - 2/3Q), np.log(1/2 - 1/3Q) also potentially could cause the problem, especially log(1-Q), because base quality 1 is also very common.