ZeroDivisionError: float division by zero

tayebwajb commented 10 years ago

Hi, When I ran the preprocess command of PyLOH using the bedfile with target regions from exome sequencing, I get the error below. When I remove the bedfile, it runs fine. What could be the issue here? I tried different tumor-normal pairs but got the same error.

Traceback (most recent call last): File "PyLOH.py", line 110, in args.func(args) File "/home/stud-jta/Downloads/PyLOH-1.2.1/pyloh/preprocess/run_preprocess.py", line 31, in run_preprocess segments.segmentation_by_bed(normal_bam, tumor_bam, args.segments_bed) File "/home/stud-jta/Downloads/PyLOH-1.2.1/pyloh/preprocess/data.py", line 164, in segmentation_by_bed self.log2_ratio.append(np.log2(1.0*tumor_reads_num/normal_reads_num)) ZeroDivisionError: float division by zero

yil8 commented 10 years ago

Hi, Thanks for your interest in PyLOH. I assume you used target regions from the sequencing provider, e.g. Illumina TrueSeq Exome. The problem with this bedfile is that, most exome regions are pretty short, less than 200bp for example. Therefore, it's quite possbile that no reads were mapped to those small regions for either tumor samples or normal samples. Thus, normal_reads_num could be zero, and you got the division by zero error.

Even if all the small regions have non-zero reads mapped, it is still not recommended to use exome/targeted regions directly as the bedfile for input. Because those regions are too short which brings in two major disadvantages: (1) reads depth has much larger variance than longer segments(long as 100K bp); (2) most regions will have only one or even zero heterozygous SNP sites, which are not enough to support the probablistic model of PyLOH.

People at other institutes also faced this problem of exome/targeted sequencing, and now we are working on it.

Currently, there are two possible solutions for this: (1). Run segmentation algorithms (e.g BICseq) directly on the exome sequencing bam file and used the large segments among the output as the bedfile. But the accuracy of segmentation on exome sequencing is significantly lower than whole genome sequencing, due to its targeted nature.

(2). Use the log2 reads detph ratio between tumor and normal samples of each small regions as the input for segmentation algorithm designed for array-based algorithm(e.g. circular binary segmentation by DNAcopy in R). Then concatenate small regions within the same segment outputed by the segmentation algorithm(suggesting they share the same copy number). Use this concatenated region as a pseudo segment for the input bedfile, which should have larger reads depth and more heterozygous SNP sites. This should have better segmentation accuracy than (1), but it requires some modification on the original PyLOH code to handle new format of data, and we are currently working on this.

Hope the answer helps.

Thanks

yil8 commented 10 years ago

New analysis pipeline for WES data is included based on DNAcopy segmentation in release 1.4.0

uci-cbcl / PyLOH

ZeroDivisionError: float division by zero #1