jmschrei / bpnet-lite

This repository hosts a minimal version of a Python API for BPNet.
MIT License
32 stars 14 forks source link

Error in running bpnet negatives / tangermeme match.py issues #10

Closed gregorydonahue closed 4 months ago

gregorydonahue commented 4 months ago

Hi Jacob,

I'm running the latest version of bpnet-lite on some ATAC-seq data and encountering a problem in the first step, the background region calls. To preface this long question, I suspect my problems are due to an incompatible version of numpy or some other library - are there version dependencies the user should know about? I installed everything through a clean conda environment (I'm using python=3.8) with pip, as recommended, and I wound up installing numpy v1.24.3.

Anyway, when I run:

$ bpnet negatives -i data/ATAC.OCRs.narrowPeak.bed -f reference/hg19.genome.fa -o data/ATAC.OCRs.MatchedLoci.bed -l 0.02 -w 2114 -v -b data/ATAC.OCRs.bw

...I get an error after the "Loading Loci" and "Get GC content of background regions" progress bars both hit 100%, and the GC% distribution table prints:

/home/gdonahue/software/miniconda3/envs/bpnetlite2/lib/python3.8/site-packages/tangermeme/ersatz.py:448:
NumbaDeprecationWarning: The keyword argument 'nopython=False' was supplied. From Numba 0.59.0 the default
is being changed to True and use of 'nopython=False' will raise a warning as the argument will have no
effect. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit
for details.
@numba.jit(params, nopython=False)
Loading Loci: 100%|████████████████████| 125326/125326 [02:28<00:00, 844.75it/s]
100%|███████████████████████████████████████████| 25/25 [28:50<00:00, 69.23s/it]
GC Bin  Background Count    Peak Count  Chosen Count
0.00:        0         0           0
0.02:        0         0           0
0.04:        0         0           0
0.06:        0         0           0
0.08:        0         0           0
0.10:        0         0           0
0.12:        0         0           0
0.14:        0         0           0
0.16:        0         0           0
0.18:        0         0           0
0.20:        0         1           0
0.22:        0         7           0
0.24:        0         6           0
0.26:        0        10           0
0.28:        0        73           0
0.30:        0       543           0
0.32:        0      2184           0
0.34:        0      5367           0
0.36:        0      9583           0
0.38:        0     12679           0
0.40:        0     14000           0
0.42:        0     13886           0
0.44:        0     12036           0
0.46:        0     10163           0
0.48:        0      8207           0
0.50:        0      7082           0
0.52:        0      5816           0
0.54:        0      4936           0
0.56:        0      4259           0
0.58:        0      3719           0
0.60:        0      2977           0
0.62:        0      2420           0
0.64:        0      1944           0
0.66:        0      1342           0
0.68:        0       907           0
0.70:        0       505           0
0.72:        0       328           0
0.74:        0       127           0
0.76:        0        44           0
0.78:        0        17           0
0.80:        0         1           0
0.82:        0         0           0
0.84:        0         0           0
0.86:        0         0           0
0.88:        0         0           0
0.90:        0         0           0
0.92:        0         0           0
0.94:        0         0           0
0.96:        0         0           0
0.98:        0         0           0
1.00:        0         0           0
Loading Loci: 0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/gdonahue/software/MiniConda/miniconda3/envs/bpnetlite2/bin/bpnet", line 347, in <module>
    matched_loci = extract_matching_loci(
  File "/home/gdonahue/software/MiniConda/miniconda3/envs/bpnetlite2/lib/python3.8/site-packages/tangermeme/match.py", line 370, in extract_matching_loci
    X_matched, y_matched = extract_loci(matched_loci, fasta, 
  File "/home/gdonahue/software/MiniConda/miniconda3/envs/bpnetlite2/lib/python3.8/site-packages/tangermeme/io.py", line 366, in extract_loci
    seqs = torch.from_numpy(numpy.stack(seqs))
  File "<__array_function__ internals>", line 200, in stack
  File "/home/gdonahue/.local/lib/python3.8/site-packages/numpy/core/shape_base.py", line 460, in stack
    raise ValueError('need at least one array to stack')
ValueError: need at least one array to stack

So, as you can see, it is correctly binning the input ATAC-seq peaks by their GC content but has apparently not found any background regions. It definitely calculates the per-chromosome GC content scores because that second progress bar takes a while to complete. After reading the traceback and poking around in tangermeme's code, I discovered that this might be fixed by editing tangermeme/match.py in the following way:

# Line 139
#values = values.sum(axis=-1)
values = numpy.nansum(values, axis=-1)

Breaking this down, the 'values' array is meant to hold chromosome-specific data from the input bigWig file. Doing a numpy Array.sum() with NaN values in the array will cause the result to also be NaN. Since NaNs are present in the initial pyBigWig load of the data, we wind up with 'values' equal to an array of NaNs. This is not peculiar to my data, as the supplied example data presents the same way:

>>> import math, numpy, pyBigWig
>>> bw = pyBigWig.open("ENCSR000AKO_minus.bigWig", 'r')
>>> values = bw.values("chr19", 0, -1, numpy=True)
>>> values
array([nan, nan, nan, ..., nan, nan, nan], dtype=float32)

It seemed like doing the nansum() was the better option, as this treats all NaNs as zero, and indeed this seems to resolve the problem.

Then I encountered another issue:

/home/gdonahue/software/miniconda3/envs/bpnetlite2/lib/python3.8/site-packages/tangermeme/ersatz.py:448: 
NumbaDeprecationWarning: The keyword argument 'nopython=False' was supplied. From Numba 0.59.0 the default
is being changed to True and use of 'nopython=False' will raise a warning as the argument will have no
effect. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit
for details.
@numba.jit(params, nopython=False)
Loading Loci: 100%|████████████████████| 125326/125326 [02:29<00:00, 838.96it/s]
100%|███████████████████████████████████████████| 25/25 [28:56<00:00, 69.47s/it]
GC Bin  Background Count    Peak Count  Chosen Count
0.00:        0         0           0
0.02:        0         0           0
0.04:        2         0           0
0.06:        4         0           0
0.08:        7         0           0
0.10:       16         0           0
0.12:       16         0           0
0.14:       21         0           0
0.16:       42         0           0
0.18:       51         0           0
0.20:       67         1           1
0.22:      126         7           7
0.24:      245         6           6
0.26:     1367        10          10
0.28:     8234        73          73
0.30:    30260       543         543
0.32:    70297      2184        2184
0.34:   115684      5367        5367
0.36:   155482      9583        9583
0.38:   169836     12679       12679
0.40:   145501     14000       14000
0.42:   120904     13886       13886
0.44:    96091     12036       12036
0.46:    74239     10163       10163
0.48:    50949      8207        8207
0.50:    33259      7082        7082
0.52:    21527      5816        5816
0.54:    14442      4936        4936
0.56:     9834      4259        4259
0.58:     6801      3719        4633
0.60:     4292      2977        4292
0.62:     2566      2420        2566
0.64:     1564      1944        1564
0.66:      734      1342         734
0.68:      320       907         320
0.70:      120       505         120
0.72:       60       328          60
0.74:       24       127          24
0.76:       11        44          11
0.78:        4        17           4
0.80:        2         1           2
0.82:        1         0           1
0.84:        0         0           0
0.86:        0         0           0
0.88:        0         0           0
0.90:        0         0           0
0.92:        0         0           0
0.94:        0         0           0
0.96:        0         0           0
0.98:        0         0           0
1.00:        0         0           0
Loading Loci: 100%|████████████████████| 125169/125169 [05:07<00:00, 407.56it/s]
Traceback (most recent call last):
  File "/home/gdonahue/software/MiniConda/miniconda3/envs/bpnetlite2/bin/bpnet", line 347, in <module>
    matched_loci = extract_matching_loci(
  File "/home/gdonahue/software/MiniConda/miniconda3/envs/bpnetlite2/lib/python3.8/site-packages/tangermeme/match.py", line 378, in extract_matching_loci
    matched_gc = X_matched.mean(axis=-1)[:, [1, 2]].sum(axis=1).numpy()
RuntimeError: mean(): could not infer output dtype. Input dtype must be either a floating point or complex dtype. Got: Char

So, at this point we have background regions too, but we're running into some kind of data type issue while extracting the reference GC bins. Not sure why there's character data in there, but I fixed this by casting the array to float (also in tangermeme/match.py):

# Line 375
#matched_gc = X_matched.mean(axis=-1)[:, [1, 2]].sum(axis=1).numpy()
matched_gc = X_matched.float().mean(axis=-1)[:, [1, 2]].sum(axis=1).numpy()

This resolved the issue, and I now have matched background loci. The final outputs were:

GC-bin KS test stat:0.019, p-value 5.29e-20
Peak Robust Signal Minimum: 54.0
Matched Signal Maximum: 27.0

Can you see any problem with what I've done? All of that looks copacetic, right? My instinct is that some other version of numpy supports the original code, and I don't want to deviate too much from that and risk deranging the rest of the pipeline (or the results).

Thanks, Greg

jmschrei commented 4 months ago

Hi Greg

Thanks for looking into this. I'm not sure why you're encountering these issues because I thought I just moved working code from bpnet-lite to tangermeme, but both of these issues make sense. When pyBigWig stores data it doesn't store 0s, for compression reasons, and when reading data it returns NaN instead of 0 because it can't remember if there's an actual 0 there or some other unobserved marker. Usually I have a numpy.nan_to_num in there, but your method would work just as well. I'm going to make both changes and release a new version of tangermeme by the end of the week. Hopefully that will fix your issue.

jmschrei commented 4 months ago

Actually, looks like @adamyhe already encountered and fixed the issue. I added in nansum just to make sure that wouldn't be an issue.

gregorydonahue commented 4 months ago

Hi Jacob, thanks for getting back to me and good to hear that my orientation on this is correct. I don't know why my installation of the software is using an older version of tangermeme - I just ran a clean install now and I see the same problem. Does it have to do with pip's repository, maybe? I don't actually know how GitHub:pip crosstalk/updating works. Anyway, I'll close the issue. Thanks, -G

jmschrei commented 4 months ago

I may not have released a new version on pip yet. I'll be doing so tomorrow after completing my addition of other features.