Feature request: option to not bin and not rescale bedgraph tracks

jnmaloof commented 1 year ago

There are times where binning the data in a bedgraph track may not be the correct thing to do. I hardcoded the changes that I needed to make this plot (that I needed...). Other uses might want this use case as well. If of interest I might be able to work this into a more general solution for a pull request (but probably not any time soon... a lot on my plate).

Anyway, thanks again for this package, it works really well and was pretty easy to modify for my needs.

plotsr

mnshgl0110 commented 1 year ago

Hi Julin. You are correct that binning might not be always optimal. If I recall correctly, the reason I had to that was to improve memory requirements. As BEDGRAPH files can have values for each base, it can take a lot of memory without binning. And for users inexperienced in informatics, it can become a challenge to run it locally.

Of course, it would be cool to have a general solution which can handle large volume of data and be more accurate.

jnmaloof commented 1 year ago

Thanks, that makes sense. Although in some cases averaging instead of binning might be better. I understand that it is trickier than I originally thought because the bedgraph file may or may not have windows already, and those may or may not correspond to what plotsr is doing. Maybe it would be possible to have an "as-is" option for the bedgraph file with a warning about the memory. Anyway I'll think about it and see if I can come up with something generalizable.

On Mon, Nov 7, 2022 at 2:52 AM Manish Goel @.***> wrote:

Hi Julin. You are correct that binning might not be always optimal. If I recall correctly, the reason I had to that was to improve memory requirements. As BEDGRAPH files can have values for each base, it can take a lot of memory without binning. And for users inexperienced in informatics, it can become a challenge to run it locally.

Of course, it would be cool to have a general solution which can handle large volume of data and be more accurate.

— Reply to this email directly, view it on GitHub https://github.com/schneebergerlab/plotsr/issues/34#issuecomment-1305427171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMB7GCFPOO64WX2GAN7IO3WHDNN5ANCNFSM6AAAAAARYFRNBQ . You are receiving this because you authored the thread.Message ID: @.***>

fishercera commented 1 year ago

I actually need this for my bed file of already-binned SNPs. @jnmaloof could you possibly share how you hardcoded not scaling the y axis on bed files?

I really need to be able to show that the max 183-count SNPs in a 10k region on Chr1 is a taller peak than the max 12-count SNPs in a 10k region on Chr2

jnmaloof commented 1 year ago

Sadly I didn't comment my code, but I believe the relevant changes were in the readbedgraph function in the file func.py. My modified code for that function is at the bottom of this email. Changes either follow a commented line that has the original code, or in some cases the original code is commented at the end of a line.

My fully changed files is available at https://github.com/MaloofLab/Davis_B_napus_assembly_2023/blob/main/HE_Analysis/plotsr/modified_func.py , but we changed a lot of other things as well (custom colors for translocations based on the chromosome of origin, etc).

If you want a working example you can clone the whole repo and then run the code in https://github.com/MaloofLab/Davis_B_napus_assembly_2023/blob/main/HE_Analysis/Julin_plotsr.Rmd (it is an R file that first creates the plotting files and eventually calls the python plotsr)

    # Read input bedgraph file
    def _readbedgraph(self, chrlengths):
        from collections import deque, defaultdict
        import numpy as np
        from math import ceil

        bw = int(self.bw)
        _chrs = set([c for c in chrlengths[0][1].keys()])
        bincnt = defaultdict(deque)
        skipchrs = []
        curchr = ''
        added_chrs = list()
        with open(self.f, 'r') as fin:
            for line in fin:
                line = line.strip().split()
                try:
                    v = float(line[3])
                except ValueError:
                    if len(line) < 4:
                        self.logger.warning("Incomplete information in bedgraph file at line: {}. Skipping it.".format("\t".join(line)))
                        continue
                if line[0] not in _chrs:
                    if line[0] == '#': continue
                    if line[0] == 'track': continue
                    if line[0] not in skipchrs:
                        self.logger.warning("Chromosome in BEDGRAPH is not present in FASTA or not selected for plotting. Skipping it. BED line: {}".format("\t".join(line)))
                        skipchrs.append(line[0])
                    continue
                if curchr == '':
                    curchr = line[0]
                    #binv = np.zeros(ceil(chrlengths[0][1][curchr]/bw), dtype=float)
                    binv = np.full(ceil(chrlengths[0][1][curchr]/bw), np.nan, dtype=float)
                    s = int(line[1])
                    e = int(line[2])
                    if s//bw == e//bw:
                        binv[s//bw] = v #+= (e-s)*v
                    else:
                        binv[s//bw] = v #+= (bw-(s%bw))*v
                        binv[e//bw] = v #+= (e%bw)*v
                elif curchr == line[0]:
                    s = int(line[1])
                    e = int(line[2])
                    if s//bw == e//bw:
                        binv[s//bw] = v #+= (e-s)*v
                    else:
                        binv[s//bw] = v #+= (bw-(s%bw))*v
                        binv[e//bw] = v #+= (e%bw)*v
                else:
                    if line[0] in added_chrs:
                        self.logger.error("BedGraph file: {} is not sorted. For plotting tracks, sorted BedGraph file is required. Exiting.".format(self.f))
                        sys.exit()
                    bins = np.concatenate((np.arange(0, chrlengths[0][1][curchr], bw), np.array([chrlengths[0][1][curchr]])), axis=0)
                    bins = [(bins[i] + bins[i+1])/2 for i in range(len(bins) - 1)]
                    bincnt[curchr] = deque([(bins[i], binv[i]) for i in range(len(bins))])
                    added_chrs.append(curchr)
                    # Set the new chromosome
                    curchr = line[0]
                    #binv = np.zeros(ceil(chrlengths[0][1][curchr]/bw), dtype=float)
                    binv = np.full(ceil(chrlengths[0][1][curchr]/bw), np.nan, dtype=float)
                    s = int(line[1])
                    e = int(line[2])
                    if s//bw == e//bw:
                        binv[s//bw] = v # += (e-s)*v
                    else:
                        binv[s//bw] = v # += (bw-(s%bw))*v
                        binv[e//bw] = v # += (e%bw)*v
        bins = np.concatenate((np.arange(0, chrlengths[0][1][curchr], bw), np.array([chrlengths[0][1][curchr]])), axis=0)
        bins = [(bins[i] + bins[i+1])/2 for i in range(len(bins) - 1)]
        bincnt[curchr] = deque([(bins[i], binv[i]) for i in range(len(bins))])
        ## Scale count values
        # maxv = 0
        # for k, v in bincnt.items():
        #     for r in v:
        #         if r[1] > maxv:
        #             maxv = r[1]
        # for k, v in bincnt.items():
        #     bincnt[k] = deque([(r[0], r[1]/maxv) for r in v])
        self.bincnt = bincnt
        return
    # END

schneebergerlab / plotsr

Feature request: option to not bin and not rescale bedgraph tracks #34