Closed gregorydonahue closed 4 months ago
Hi Greg
Thanks for looking into this. I'm not sure why you're encountering these issues because I thought I just moved working code from bpnet-lite to tangermeme, but both of these issues make sense. When pyBigWig stores data it doesn't store 0s, for compression reasons, and when reading data it returns NaN instead of 0 because it can't remember if there's an actual 0 there or some other unobserved marker. Usually I have a numpy.nan_to_num
in there, but your method would work just as well. I'm going to make both changes and release a new version of tangermeme by the end of the week. Hopefully that will fix your issue.
Actually, looks like @adamyhe already encountered and fixed the issue. I added in nansum
just to make sure that wouldn't be an issue.
Hi Jacob, thanks for getting back to me and good to hear that my orientation on this is correct. I don't know why my installation of the software is using an older version of tangermeme - I just ran a clean install now and I see the same problem. Does it have to do with pip's repository, maybe? I don't actually know how GitHub:pip crosstalk/updating works. Anyway, I'll close the issue. Thanks, -G
I may not have released a new version on pip yet. I'll be doing so tomorrow after completing my addition of other features.
Hi Jacob,
I'm running the latest version of bpnet-lite on some ATAC-seq data and encountering a problem in the first step, the background region calls. To preface this long question, I suspect my problems are due to an incompatible version of numpy or some other library - are there version dependencies the user should know about? I installed everything through a clean conda environment (I'm using python=3.8) with pip, as recommended, and I wound up installing numpy v1.24.3.
Anyway, when I run:
...I get an error after the "Loading Loci" and "Get GC content of background regions" progress bars both hit 100%, and the GC% distribution table prints:
So, as you can see, it is correctly binning the input ATAC-seq peaks by their GC content but has apparently not found any background regions. It definitely calculates the per-chromosome GC content scores because that second progress bar takes a while to complete. After reading the traceback and poking around in tangermeme's code, I discovered that this might be fixed by editing tangermeme/match.py in the following way:
Breaking this down, the 'values' array is meant to hold chromosome-specific data from the input bigWig file. Doing a numpy Array.sum() with NaN values in the array will cause the result to also be NaN. Since NaNs are present in the initial pyBigWig load of the data, we wind up with 'values' equal to an array of NaNs. This is not peculiar to my data, as the supplied example data presents the same way:
It seemed like doing the nansum() was the better option, as this treats all NaNs as zero, and indeed this seems to resolve the problem.
Then I encountered another issue:
So, at this point we have background regions too, but we're running into some kind of data type issue while extracting the reference GC bins. Not sure why there's character data in there, but I fixed this by casting the array to float (also in tangermeme/match.py):
This resolved the issue, and I now have matched background loci. The final outputs were:
Can you see any problem with what I've done? All of that looks copacetic, right? My instinct is that some other version of numpy supports the original code, and I don't want to deviate too much from that and risk deranging the rest of the pipeline (or the results).
Thanks, Greg