Closed ghuls closed 4 months ago
Closes: #38
Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats.
(Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged)
Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats.
(Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged)
Looks like quite a bit of work in #44.
I was trying out returning a named tuple and got it to work, but noticed that my command was taking 5 seconds instead of around 4 seconds before.
So I decided to test creation times of tuples, dicts, named tuples and class instances. Named tuples seem to be the slowest to create:
In [1]: %timeit tuple_style = ('chr1\t3094805\t3095305\tchr1:3094805-3095305', 0.06325921607762575, 0.06325921607
...: 762575, 0.044780001044273376, 0.044780001044273376, 0.08956000208854675, 0.08956000208854675, 500, 500, 3
...: 1.629608038812876)
6.97 ns ± 0.0829 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
In [3]: %timeit dict_style = {"name": 'chr1\t3094805\t3095305\tchr1:3094805-3095305', "mean": 0.06325921607762575
...: , "mean0": 0.06325921607762575, "min": 0.044780001044273376, "min0": 0.044780001044273376, "max": 0.08956
...: 000208854675, "max0": 0.08956000208854675, "size": 500, "bases": 500, "sum": 31.629608038812876}
202 ns ± 6.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [5]: %timeit named_tuple_style = pybigtools.BigWigAverageOverBedSummaryStatistics(name='chr1\t3094805\t3095305
...: \tchr1:3094805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, m
...: in0=0.044780001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.
...: 629608038812876)
548 ns ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
class BigWigAverageOverBedSummaryStatisticsClass:
def __init__(self, name, mean, mean0, min, min0, max, max0, size, bases, sum):
self.name = name
self.mean = mean
self.mean0 = mean0
self.min = min
self.min0 = min0
self.max = max
self.max0 = max0
self.size = size
self.bases = bases
self.sum = sum
In [12]: %timeit class_style = BigWigAverageOverBedSummaryStatisticsClass(name='chr1\t3094805\t3095305\tchr1:3094
...: 805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, min0=0.0447
...: 80001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.629608038
...: 812876)
449 ns ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
It would be nice if bigwigaverageoverbed would return mean, min ,max, (only considering positions that have actual values, returning NaN if needed) and mean0, min0 and max0 (assuming 0.0 for all positions that don''t have an entry in the bigWig). This would change the behavior of mean different compared to Kent tools (NaN instead of 0.0 when no values are found), but feels more consistent.
-fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
+fn pybigtools(py: Python, m: &PyModule) -> PyResult<()> {
m.add("__version__", env!("CARGO_PKG_VERSION"))?;
m.add_wrapped(wrap_pyfunction!(open))?;
m.add_wrapped(wrap_pyfunction!(bigWigAverageOverBed))?;
@@ -1366,5 +1382,25 @@ fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_class::<EntriesIterator>()?;
+ let collections = PyModule::import(py, "collections").unwrap();
+ let namedtuple = collections.getattr("namedtuple").unwrap();
+ let bigwigaverageoverbed_summary_statistics = namedtuple
+ .call(
+ (
+ "BigWigAverageOverBedSummaryStatistics".to_object(py),
+ (
+ "name", "mean", "mean0", "min", "min0", "max", "max0", "size", "bases", "sum",
+ )
+ .to_object(py),
+ ),
+ None,
+ )
+ .unwrap();
+ m.add(
+ "BigWigAverageOverBedSummaryStatistics",
+ bigwigaverageoverbed_summary_statistics,
+ )
+ .unwrap();
+
Ok(())
}
I'm simultaneously both surprised and not that both dict and namedtuple have that much worse perf than a tuple.
I think, at least for now, I'm not so worried about the time difference. I imagine if speed is needed at that level, people would either drop down to Rust, or use some future API that allows batching and returning a pre-allocated array or something.
I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values
(which will have a robust enough API) and doing to summary stats on that.
I think I'm pretty happy with the branch I have at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats. I updated the API to match the comments above. Would love to hear your thoughts on it.
Looks good. Do you know how I can check that branch out to try it locally?
git fetch https://github.com/jackh726/bigtools.git bigtools_refactor_rebase2:bwaob_refactor
git checkout bwaob_refactor
should work
Managed to get the correct branch:
git fetch https://github.com/jackh726/bigtools.git ghuls/add_more_bigwigaverageoverbed_stats:bigtools_ghuls_add_more_bigwigaverageoverbed_stats
Help needs to be fixed:
If ``"all"`` is specified, all summary statistics are returned in a named tuple.
If a single statistic is provided as a string, that statistic is returned
as a float.
If a single statistic is provided as a string, that statistic is returned
as a float or int depending on the statistic.
Giving an unknown statistic (like "all"
does not trigger and exception or an error message:
In [32]: %%time
...: with pybigtools.open("test.bw", "r") as bw:
...: x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0"]))
...: x[0]
...:
...:
CPU times: user 4.93 s, sys: 36.2 ms, total: 4.97 s
Wall time: 4.96 s
Out[32]: 0.06325921607762575
In [33]: %%time
...: with pybigtools.open("test.bw", "r") as bw:
...: x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all"]))
...: x[0]
...:
...:
CPU times: user 90 µs, sys: 2 µs, total: 92 µs
Wall time: 95.8 µs
Out[33]: 0.06325921607762575
In [34]: %%time
...: with pybigtools.open("test.bw", "r") as bw:
...: x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all2"]))
...: x[0]
...:
...:
CPU times: user 49 µs, sys: 1e+03 ns, total: 50 µs
Wall time: 52.9 µs
Out[34]: 0.06325921607762575
In [35]: %%time
...: with pybigtools.open("test.bw", "r") as bw:
...: x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0", "mean", "min", "max", "size", "bases", "sum"]))
...: x[0]
...:
...:
CPU times: user 5.13 s, sys: 80 ms, total: 5.21 s
Wall time: 5.2 s
Out[35]:
(0.06325921607762575,
0.06325921607762575,
0.044780001044273376,
0.08956000208854675,
500,
500,
31.629608038812876)
It would be nice to have a method on BBIRead to reinitialize it (similar to seek(0)
, so that you can run e.g. multiple average_over_bed commands on the same bigwig filehandle.
with pybigtools.open("test.bw", "r") as bw:
x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"]))
y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"]))
Opening BED files as a Path does not work (works for opening the bigwig):
bw.average_over_bed(Path("file.bed"))
Help needs to be fixed:
If ``"all"`` is specified, all summary statistics are returned in a named tuple. If a single statistic is provided as a string, that statistic is returned as a float.
If a single statistic is provided as a string, that statistic is returned as a float or int depending on the statistic.
Done
Giving an unknown statistic (like
"all"
does not trigger and exception or an error message:
Strange...there is a test that shows that it does:
assert pytest.raises(ValueError, bw.average_over_bed, path, None, 'bad')
assert pytest.raises(ValueError, bw.average_over_bed, path, None, ['min', 'bad'])
assert pytest.raises(ValueError, bw.average_over_bed, path, None, ['min', 'all'])
It would be nice to have a method on BBIRead to reinitialize it (similar to
seek(0)
, so that you can run e.g. multiple average_over_bed commands on the same bigwig filehandle.with pybigtools.open("test.bw", "r") as bw: x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"])) y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"]))
This already should have worked - added a test to show.
Opening BED files as a Path does not work (works for opening the bigwig):
bw.average_over_bed(Path("file.bed"))
Added this + test
(feel free to just force push that branch to this PR branch - I can do it directly if you don't know how. Do want to give you credit for your work here)
Calculating multiple average_over_bed
on the same BBIRead handle andd "all"
statistics now works. Not sure what was wrong here with my installation.
In [14]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
...: x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="all"))
...: print(y_list[0], x_list[0])
...:
...:
0.06325921607762575 SummaryStatistics(size=500, bases=500, sum=31.629608038812876, mean0=0.06325921607762575, mean=0.06325921607762575, min=0.044780001044273376, max=0.08956000208854675)
CPU times: user 8.71 s, sys: 188 ms, total: 8.9 s
Wall time: 8.89 s
Raising an exception form a with block does not work for me:
import pybigtools
from pathlib import Path
# Don't import numpy to show with block does not pass exception.
# import numpy as np
In [2]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
...: x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"))
...: print(y_list[0], x_list[0])
...:
...:
0.06325921607762575 0.08956000208854675
CPU times: user 8.1 s, sys: 228 ms, total: 8.33 s
Wall time: 8.32 s
# Numpy is not imported, so the code should trow an exception, but doesn't:
In [3]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
...: x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
...: print(y_array[0], x_array[0])
...:
...:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File <timed exec>:4
NameError: name 'y_array' is not defined
In [4]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
...: x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
...: #print(y_array[0], x_array[0])
...:
...:
CPU times: user 47 μs, sys: 1 μs, total: 48 μs
Wall time: 50.5 μs
# In a normal with block it trows an exception:
In [5]: %%time
...: with open(Path("test.bw"), "r") as bw:
...: y_array = np.fromiter(bw.read(), dtype=np.float32)
...: x_array = np.fromiter(bw.read(), dtype=np.float32)
...: print(y_array[0], x_array[0])
...:
...:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
File <timed exec>:2
NameError: name 'np' is not defined
# After importing numpy
import numpy as np
In [9]: %%time
...: with pybigtools.open(Path("test.bw"), "r") as bw:
...: y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
...: x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
...: print(y_array[0], x_array[0])
...:
...:
0.063259214 0.08956
CPU times: user 8.24 s, sys: 160 ms, total: 8.4 s
Wall time: 8.4 s
Minimal example:
In [11]: with pybigtools.open(Path("test.bw"), "r") as bw:
...: raise ValueError("Some error")
...:
In [12]: with open(Path("test.bw"), "r") as bw:
...: raise ValueError("Some error")
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[12], line 2
1 with open(Path("test.bw"), "r") as bw:
----> 2 raise ValueError("Some error")
ValueError: Some error
Raising an exception form a with block does not work for me:
Thanks, this was a problem with #44, which is fixed now (with a test)
Raising an exception form a with block does not work for me:
Thanks, this was a problem with #44, which is fixed now (with a test)
I now assume that this Iwas the reason why invoking average_over_bed
twice on the bigWig filehandle didn't work as I might have used some code that invoked an exception.
Exceptions are now raised properly.
Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.
Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ
If that is really the case, could the min0
, max0
options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).
I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.
I now assume that this Iwas the reason why invoking
average_over_bed
twice on the bigWig filehandle didn't work as I might have used some code that invoked an exception.Exceptions are now raised properly.
Good to hear
Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.
Maybe today, most likely sometime this weekend
Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ
If that is really the case, could the
min0
,max0
options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).
Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to values
too.
I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.
Just pushed a commit for this.
Going to go ahead and merge this - happy to follow up on min0/max0.
Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.
Maybe today, most likely sometime this weekend
Great!
Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ If that is really the case, could the
min0
,max0
options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to
values
too.
I think in pyBigWig returns np.nan for those positions.
I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.
Just pushed a commit for this.
Thanks.
Seems like UCSC does not have time to fix bigWigAverageOverBed: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ
Thank you for providing those details. Unfortunately, we don't have the bandwidth to fix this issue with bigWigAverageOverBed at this time.
However, we can point people to the bigtools suite in the future if they ask about tools with similar functionality.
So I guess addin min0/max0 to bigtools should not be a big issue if bigWigAverageOverBed
is basically abandoned.
Closes #38