jackh726 / bigtools

A high-performance BigWig and BigBed library in Rust
MIT License
70 stars 5 forks source link

Implement `-minMax` option of bigWigAverageOverBed. #38

Closed ghuls closed 4 months ago

ghuls commented 5 months ago

Implement -minMax option of bigWigAverageOverBed in bigwigaverageoverbed- (as eg: -minmax)

$ bigWigAverageOverBed 
bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns.
usage:
   bigWigAverageOverBed in.bw in.bed out.tab
The output columns are:
   name - name field from bed, which should be unique
   size - size of bed (sum of exon sizes
   covered - # bases within exons covered by bigWig
   sum - sum of values over all bases covered
   mean0 - average over bases with non-covered bases counting as zeroes
   mean - average over just covered bases
Options:
   -stats=stats.ra - Output a collection of overall statistics to stat.ra file
   -bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended
   -sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather
                     than the usual sample in the bed item.
   -minMax - include two additional columns containing the min and max observed in the area.
LukasMahieu commented 4 months ago

Seems like a minor change but would be a major improvement for us since we use max and count often. Any plans to implement this @jackh726 ?

jackh726 commented 4 months ago

Yes, I can try to get to this over the weekend! Should be a small and easy change.

ghuls commented 4 months ago

@jackh726 I almost have a pull request ready for bigwigaverageoverbed and pybigtools:

In [21]: ?pybig.bigWigAverageOverBed
Signature: pybig.bigWigAverageOverBed(bigwig, bed, names=None, stats=None)
Docstring:
Gets the average values from a bigWig over the entries of a bed file.

Parameters:
    bigWig (str): The path to the bigWig.
    bed (str): The path to the bed.
    names (None, bool, or int):
        If `None`, then each return value will be a single `float` or `int`,
            the calculated statistics value over an interval in the bed file.
        If `True`, then each return value will be a tuple of the value of column 4 and the
            calculated statistics value over the interval with that name in the bed file.
        If `False`, then each return value will be a tuple of the interval in the format
            `{chrom}:{start}-{end}` and the calculated statistics value over that interval.
        If `0`, then each return value will match as if `False` was passed.
        If a `1+`, then each return value will be a tuple of the value of column of this
            parameter (1-based) and the calculated statistics value over the interval.
    stats (None, str): Calculate specific statistics for each bed entry:
        - `size`: Size of bed entry (`int`)
        - `bases`: Bases covered by bigWig (`int`)
        - `sum`: Sum of values over all bases covered (`float`)
        - `mean0`: Average over bases with non-covered bases counting as zeroes (`float`)
        - `mean` or `None`: Average over just covered bases (`float`)
        - `min`: Minimum over all bases covered (`float`)
        - `max`: Maximum over all bases covered (`float`)
        - `all`: `size`, `bases`, `sum`, `mean0`, `mean`, `min` and `max`

Returns:
    This returns a generator of values. (Therefore, to save to a list, do `list(bigWigAverageOverBed(...))`)
    If no name is specified (see the `names` parameter above), then returns a generator of `float`s or `int`s
    depending on the chosen statistics to calculate.
    If a name column is specified (see above), then returns a generator of tuples `({name}, {stats})`
Type:      builtin_function_or_method

In [7]: bwaob = pybig.bigWigAverageOverBed(bigwig="Astro.bw", bed="consensus_peaks_bicnn.bed", names=True, stats="all")

In [8]: for i, entry in enumerate(bwaob):
   ...:     if i == 10:
   ...:         break
   ...:     print(entry)
   ...: 
('chr1:3094805-3095305', 500, 500, 31.629608038812876, 0.06325921607762575, 0.06325921607762575, 0.044780001044273376, 0.08956000208854675)
('chr1:3095470-3095970', 500, 347, 7.746940311975777, 0.015493880623951555, 0.022325476403388406, 0.0, 0.044780001044273376)
('chr1:3112174-3112674', 500, 197, 3.239086809568107, 0.006478173619136214, 0.01644206502318836, 0.0, 0.0298533346503973)
('chr1:3113534-3114034', 500, 168, 2.5076801106333733, 0.005015360221266746, 0.01492666732519865, 0.0, 0.01492666732519865)
('chr1:3119746-3120246', 500, 500, 18.73296721186489, 0.03746593442372978, 0.03746593442372978, 0.01492666732519865, 0.044780001044273376)
('chr1:3120272-3120772', 500, 500, 23.28560095373541, 0.04657120190747082, 0.04657120190747082, 0.01492666732519865, 0.07463333755731583)
('chr1:3121251-3121751', 500, 348, 13.463853831402957, 0.026927707662805916, 0.038689235147709646, 0.0, 0.07463333755731583)
('chr1:3134586-3135086', 500, 385, 8.045473688282073, 0.016090947376564146, 0.02089733425527811, 0.0, 0.0298533346503973)
('chr1:3165708-3166208', 500, 294, 4.388440193608403, 0.008776880387216806, 0.01492666732519865, 0.0, 0.01492666732519865)
('chr1:3166923-3167423', 500, 99, 1.4777400651946664, 0.002955480130389333, 0.01492666732519865, 0.0, 0.01492666732519865)

In [9]: bwaob = pybig.bigWigAverageOverBed(bigwig="Astro.bw", bed="consensus_peaks_bicnn.bed", names=False, stats="all")

In [10]: for i, entry in enumerate(bwaob):
    ...:     if i == 10:
    ...:         break
    ...:     print(entry)
    ...: 
('chr1\t3094805\t3095305\tchr1:3094805-3095305', 500, 500, 31.629608038812876, 0.06325921607762575, 0.06325921607762575, 0.044780001044273376, 0.08956000208854675)
('chr1\t3095470\t3095970\tchr1:3095470-3095970', 500, 347, 7.746940311975777, 0.015493880623951555, 0.022325476403388406, 0.0, 0.044780001044273376)
('chr1\t3112174\t3112674\tchr1:3112174-3112674', 500, 197, 3.239086809568107, 0.006478173619136214, 0.01644206502318836, 0.0, 0.0298533346503973)
('chr1\t3113534\t3114034\tchr1:3113534-3114034', 500, 168, 2.5076801106333733, 0.005015360221266746, 0.01492666732519865, 0.0, 0.01492666732519865)
('chr1\t3119746\t3120246\tchr1:3119746-3120246', 500, 500, 18.73296721186489, 0.03746593442372978, 0.03746593442372978, 0.01492666732519865, 0.044780001044273376)
('chr1\t3120272\t3120772\tchr1:3120272-3120772', 500, 500, 23.28560095373541, 0.04657120190747082, 0.04657120190747082, 0.01492666732519865, 0.07463333755731583)
('chr1\t3121251\t3121751\tchr1:3121251-3121751', 500, 348, 13.463853831402957, 0.026927707662805916, 0.038689235147709646, 0.0, 0.07463333755731583)
('chr1\t3134586\t3135086\tchr1:3134586-3135086', 500, 385, 8.045473688282073, 0.016090947376564146, 0.02089733425527811, 0.0, 0.0298533346503973)
('chr1\t3165708\t3166208\tchr1:3165708-3166208', 500, 294, 4.388440193608403, 0.008776880387216806, 0.01492666732519865, 0.0, 0.01492666732519865)
('chr1\t3166923\t3167423\tchr1:3166923-3167423', 500, 99, 1.4777400651946664, 0.002955480130389333, 0.01492666732519865, 0.0, 0.01492666732519865)

In [11]: bwaob = pybig.bigWigAverageOverBed(bigwig="Astro.bw", bed="consensus_peaks_bicnn.bed", names=None, stats="all")

In [12]: for i, entry in enumerate(bwaob):
    ...:     if i == 10:
    ...:         break
    ...:     print(entry)
    ...: 
(500, 500, 31.629608038812876, 0.06325921607762575, 0.06325921607762575, 0.044780001044273376, 0.08956000208854675)
(500, 347, 7.746940311975777, 0.015493880623951555, 0.022325476403388406, 0.0, 0.044780001044273376)
(500, 197, 3.239086809568107, 0.006478173619136214, 0.01644206502318836, 0.0, 0.0298533346503973)
(500, 168, 2.5076801106333733, 0.005015360221266746, 0.01492666732519865, 0.0, 0.01492666732519865)
(500, 500, 18.73296721186489, 0.03746593442372978, 0.03746593442372978, 0.01492666732519865, 0.044780001044273376)
(500, 500, 23.28560095373541, 0.04657120190747082, 0.04657120190747082, 0.01492666732519865, 0.07463333755731583)
(500, 348, 13.463853831402957, 0.026927707662805916, 0.038689235147709646, 0.0, 0.07463333755731583)
(500, 385, 8.045473688282073, 0.016090947376564146, 0.02089733425527811, 0.0, 0.0298533346503973)
(500, 294, 4.388440193608403, 0.008776880387216806, 0.01492666732519865, 0.0, 0.01492666732519865)
(500, 99, 1.4777400651946664, 0.002955480130389333, 0.01492666732519865, 0.0, 0.01492666732519865)

In [13]: bwaob = pybig.bigWigAverageOverBed(bigwig="Astro.bw", bed="consensus_peaks_bicnn.bed", names=None, stats="max")

In [14]: for i, entry in enumerate(bwaob):
    ...:     if i == 10:
    ...:         break
    ...:     print(entry)
    ...: 
0.08956000208854675
0.044780001044273376
0.0298533346503973
0.01492666732519865
0.044780001044273376
0.07463333755731583
0.07463333755731583
0.0298533346503973
0.01492666732519865
0.01492666732519865

In [15]: bwaob = pybig.bigWigAverageOverBed(bigwig="Astro.bw", bed="consensus_peaks_bicnn.bed", names=None, stats="mean0")

In [16]: for i, entry in enumerate(bwaob):
    ...:     if i == 10:
    ...:         break
    ...:     print(entry)
    ...: 
0.06325921607762575
0.015493880623951555
0.006478173619136214
0.005015360221266746
0.03746593442372978
0.04657120190747082
0.026927707662805916
0.016090947376564146
0.008776880387216806
0.002955480130389333

In [17]: bwaob = pybig.bigWigAverageOverBed(bigwig="Astro.bw", bed="consensus_peaks_bicnn.bed", names=None, stats="mean")

In [18]: for i, entry in enumerate(bwaob):
    ...:     if i == 10:
    ...:         break
    ...:     print(entry)
    ...: 
0.06325921607762575
0.022325476403388406
0.01644206502318836
0.01492666732519865
0.03746593442372978
0.04657120190747082
0.038689235147709646
0.02089733425527811
0.01492666732519865
0.01492666732519865

In [19]: bwaob = pybig.bigWigAverageOverBed(bigwig="Astro.bw", bed="consensus_peaks_bicnn.bed", names=None, stats=None)

In [20]: for i, entry in enumerate(bwaob):
    ...:     if i == 10:
    ...:         break
    ...:     print(entry)
    ...: 
0.06325921607762575
0.022325476403388406
0.01644206502318836
0.01492666732519865
0.03746593442372978
0.04657120190747082
0.038689235147709646
0.02089733425527811
0.01492666732519865