Add more bigwigaverageoverbed stats

ghuls commented 4 months ago

Closes #38

ghuls commented 4 months ago

Closes: #38

jackh726 commented 4 months ago

Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats.

(Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged)

ghuls commented 4 months ago

Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats.

(Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged)

Looks like quite a bit of work in #44.

I was trying out returning a named tuple and got it to work, but noticed that my command was taking 5 seconds instead of around 4 seconds before.

So I decided to test creation times of tuples, dicts, named tuples and class instances. Named tuples seem to be the slowest to create:


In [1]: %timeit tuple_style = ('chr1\t3094805\t3095305\tchr1:3094805-3095305', 0.06325921607762575, 0.06325921607
   ...: 762575, 0.044780001044273376, 0.044780001044273376, 0.08956000208854675, 0.08956000208854675, 500, 500, 3
   ...: 1.629608038812876)
6.97 ns ± 0.0829 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

In [3]: %timeit dict_style = {"name": 'chr1\t3094805\t3095305\tchr1:3094805-3095305', "mean": 0.06325921607762575
   ...: , "mean0": 0.06325921607762575, "min": 0.044780001044273376, "min0": 0.044780001044273376, "max": 0.08956
   ...: 000208854675, "max0": 0.08956000208854675, "size": 500, "bases": 500, "sum": 31.629608038812876}
202 ns ± 6.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: %timeit named_tuple_style = pybigtools.BigWigAverageOverBedSummaryStatistics(name='chr1\t3094805\t3095305
   ...: \tchr1:3094805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, m
   ...: in0=0.044780001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.
   ...: 629608038812876)
548 ns ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

class BigWigAverageOverBedSummaryStatisticsClass:
    def __init__(self, name, mean, mean0, min, min0, max, max0, size, bases, sum):
        self.name = name
        self.mean = mean
        self.mean0 = mean0
        self.min = min
        self.min0 = min0
        self.max = max
        self.max0 = max0
        self.size = size
        self.bases = bases
        self.sum = sum

In [12]: %timeit class_style = BigWigAverageOverBedSummaryStatisticsClass(name='chr1\t3094805\t3095305\tchr1:3094
    ...: 805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, min0=0.0447
    ...: 80001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.629608038
    ...: 812876)
449 ns ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

It would be nice if bigwigaverageoverbed would return mean, min ,max, (only considering positions that have actual values, returning NaN if needed) and mean0, min0 and max0 (assuming 0.0 for all positions that don''t have an entry in the bigWig). This would change the behavior of mean different compared to Kent tools (NaN instead of 0.0 when no values are found), but feels more consistent.

-fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
+fn pybigtools(py: Python, m: &PyModule) -> PyResult<()> {
     m.add("__version__", env!("CARGO_PKG_VERSION"))?;
     m.add_wrapped(wrap_pyfunction!(open))?;
     m.add_wrapped(wrap_pyfunction!(bigWigAverageOverBed))?;
@@ -1366,5 +1382,25 @@ fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {

     m.add_class::<EntriesIterator>()?;

+    let collections = PyModule::import(py, "collections").unwrap();
+    let namedtuple = collections.getattr("namedtuple").unwrap();
+    let bigwigaverageoverbed_summary_statistics = namedtuple
+        .call(
+            (
+                "BigWigAverageOverBedSummaryStatistics".to_object(py),
+                (
+                    "name", "mean", "mean0", "min", "min0", "max", "max0", "size", "bases", "sum",
+                )
+                    .to_object(py),
+            ),
+            None,
+        )
+        .unwrap();
+    m.add(
+        "BigWigAverageOverBedSummaryStatistics",
+        bigwigaverageoverbed_summary_statistics,
+    )
+    .unwrap();
+
     Ok(())
 }

jackh726 commented 4 months ago

I'm simultaneously both surprised and not that both dict and namedtuple have that much worse perf than a tuple.

I think, at least for now, I'm not so worried about the time difference. I imagine if speed is needed at that level, people would either drop down to Rust, or use some future API that allows batching and returning a pre-allocated array or something.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

I think I'm pretty happy with the branch I have at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats. I updated the API to match the comments above. Would love to hear your thoughts on it.

ghuls commented 4 months ago

Looks good. Do you know how I can check that branch out to try it locally?

jackh726 commented 4 months ago

git fetch https://github.com/jackh726/bigtools.git bigtools_refactor_rebase2:bwaob_refactor
git checkout bwaob_refactor

should work

ghuls commented 4 months ago

Managed to get the correct branch:

git fetch https://github.com/jackh726/bigtools.git ghuls/add_more_bigwigaverageoverbed_stats:bigtools_ghuls_add_more_bigwigaverageoverbed_stats

Help needs to be fixed:

    If ``"all"`` is specified, all summary statistics are returned in a named tuple. 

    If a single statistic is provided as a string, that statistic is returned
    as a float.

    If a single statistic is provided as a string, that statistic is returned
    as a float or int depending on the statistic.

Giving an unknown statistic (like "all" does not trigger and exception or an error message:

In [32]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 4.93 s, sys: 36.2 ms, total: 4.97 s
Wall time: 4.96 s
Out[32]: 0.06325921607762575

In [33]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 90 µs, sys: 2 µs, total: 92 µs
Wall time: 95.8 µs
Out[33]: 0.06325921607762575

In [34]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all2"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 49 µs, sys: 1e+03 ns, total: 50 µs
Wall time: 52.9 µs
Out[34]: 0.06325921607762575

In [35]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0", "mean", "min", "max", "size", "bases", "sum"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 5.13 s, sys: 80 ms, total: 5.21 s
Wall time: 5.2 s
Out[35]: 
(0.06325921607762575,
 0.06325921607762575,
 0.044780001044273376,
 0.08956000208854675,
 500,
 500,
 31.629608038812876)

ghuls commented 4 months ago

It would be nice to have a method on BBIRead to reinitialize it (similar to seek(0), so that you can run e.g. multiple average_over_bed commands on the same bigwig filehandle.

with pybigtools.open("test.bw", "r") as bw:
    x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"]))
    y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"]))

ghuls commented 4 months ago

Opening BED files as a Path does not work (works for opening the bigwig):

bw.average_over_bed(Path("file.bed"))

jackh726 commented 4 months ago

Help needs to be fixed:

    If ``"all"`` is specified, all summary statistics are returned in a named tuple. 

    If a single statistic is provided as a string, that statistic is returned
    as a float.

    If a single statistic is provided as a string, that statistic is returned
    as a float or int depending on the statistic.

Done

Giving an unknown statistic (like "all" does not trigger and exception or an error message:

Strange...there is a test that shows that it does:

assert pytest.raises(ValueError, bw.average_over_bed, path, None, 'bad')
assert pytest.raises(ValueError, bw.average_over_bed, path, None, ['min', 'bad'])
assert pytest.raises(ValueError, bw.average_over_bed, path, None, ['min', 'all'])

It would be nice to have a method on BBIRead to reinitialize it (similar to seek(0), so that you can run e.g. multiple average_over_bed commands on the same bigwig filehandle.
with pybigtools.open("test.bw", "r") as bw:
    x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"]))
    y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"]))

This already should have worked - added a test to show.

Opening BED files as a Path does not work (works for opening the bigwig):
bw.average_over_bed(Path("file.bed"))

Added this + test

jackh726 commented 4 months ago

(feel free to just force push that branch to this PR branch - I can do it directly if you don't know how. Do want to give you credit for your work here)

ghuls commented 4 months ago

Calculating multiple average_over_bed on the same BBIRead handle andd "all" statistics now works. Not sure what was wrong here with my installation.

In [14]: %%time
    ...: with pybigtools.open(Path("test.bw"), "r") as bw:
    ...:     y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
    ...:     x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="all"))
    ...: print(y_list[0], x_list[0])
    ...: 
    ...: 
0.06325921607762575 SummaryStatistics(size=500, bases=500, sum=31.629608038812876, mean0=0.06325921607762575, mean=0.06325921607762575, min=0.044780001044273376, max=0.08956000208854675)
CPU times: user 8.71 s, sys: 188 ms, total: 8.9 s
Wall time: 8.89 s

Raising an exception form a with block does not work for me:

import pybigtools
from pathlib import Path

# Don't import numpy to show with block does not pass exception.
# import numpy as np

In [2]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
   ...:     x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"))
   ...: print(y_list[0], x_list[0])
   ...: 
   ...: 
0.06325921607762575 0.08956000208854675
CPU times: user 8.1 s, sys: 228 ms, total: 8.33 s
Wall time: 8.32 s

# Numpy is not imported, so the code should trow an exception, but doesn't:
In [3]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:4

NameError: name 'y_array' is not defined

In [4]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: #print(y_array[0], x_array[0])
   ...: 
   ...: 
CPU times: user 47 μs, sys: 1 μs, total: 48 μs
Wall time: 50.5 μs

# In a normal with block it trows an exception:
In [5]: %%time
   ...: with open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.read(), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.read(), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:2

NameError: name 'np' is not defined

# After importing numpy
import numpy as np

In [9]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
0.063259214 0.08956
CPU times: user 8.24 s, sys: 160 ms, total: 8.4 s
Wall time: 8.4 s

Minimal example:

In [11]: with pybigtools.open(Path("test.bw"), "r") as bw:
    ...:     raise ValueError("Some error")
    ...: 

In [12]: with open(Path("test.bw"), "r") as bw:
    ...:     raise ValueError("Some error")
    ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 2
      1 with open(Path("test.bw"), "r") as bw:
----> 2     raise ValueError("Some error")

ValueError: Some error

jackh726 commented 4 months ago

Raising an exception form a with block does not work for me:

Thanks, this was a problem with #44, which is fixed now (with a test)

ghuls commented 4 months ago

Raising an exception form a with block does not work for me:

Thanks, this was a problem with #44, which is fixed now (with a test)

I now assume that this Iwas the reason why invoking average_over_bed twice on the bigWig filehandle didn't work as I might have used some code that invoked an exception.

Exceptions are now raised properly.

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

ghuls commented 4 months ago

Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

jackh726 commented 4 months ago

I now assume that this Iwas the reason why invoking average_over_bed twice on the bigWig filehandle didn't work as I might have used some code that invoked an exception.

Exceptions are now raised properly.

Good to hear

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

Maybe today, most likely sometime this weekend

Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to values too.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

Just pushed a commit for this.

jackh726 commented 4 months ago

Going to go ahead and merge this - happy to follow up on min0/max0.

ghuls commented 4 months ago

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

Maybe today, most likely sometime this weekend

Great!

Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to values too.

I think in pyBigWig returns np.nan for those positions.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

Just pushed a commit for this.

Thanks.

ghuls commented 4 months ago

Seems like UCSC does not have time to fix bigWigAverageOverBed: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

Thank you for providing those details. Unfortunately, we don't have the bandwidth to fix this issue with bigWigAverageOverBed at this time.

However, we can point people to the bigtools suite in the future if they ask about tools with similar functionality.

So I guess addin min0/max0 to bigtools should not be a big issue if bigWigAverageOverBed is basically abandoned.

jackh726 / bigtools

Add more bigwigaverageoverbed stats #42