nvictus / pybbi

Python bindings to UCSC BigWig and BigBed library
MIT License
29 stars 3 forks source link

BigBed scores: fetch vs fetch_intervals #15

Closed LeoWelter closed 1 year ago

LeoWelter commented 3 years ago

I'm trying to work with bigbed files from JASPAR: http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/

The files contain the following columns:

*bb (bigbed) files files can be transformed to *bed files using the bigBedToBed tool from UCSC (http://hgdownload.cse.ucsc.edu/admin/exe/).
    *.tsv.gz files contain both the transformed relative scores and the transformed p-values for TF binding profile matches. Note that their coordinates are 0-based just like in the BED/bigBED files.
        Columns:
            chr
            start (0-based)
            end
            TF name
            rel_score * 1000
            -1 * log10(p_value) * 100
            strand
    To download all files at the same time, use a tool like rsync or wget.

expanded bigbed file using bigbedToBed contains:
chr1    10000   10008   VDR     240     -
chr1    10001   10009   SOX18   175     +
chr1    10001   10018   Dmbx1   328     -
chr1    10002   10010   HOXB6   179     -

The expanded bed file is ~460 GB. The bigbed file is 86 GB and I can read it using:

import bbi

url="http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/JASPAR2020_hg19.bb"

with bbi.open(url) as f:
    data = f.fetch('chr1', 94586705, 94587705)`

data now contains an array with the relative scores.

Is it possible to get the corresponding TF name column together with the scores using bbi?

nvictus commented 3 years ago

data now contains an array with the relative scores.

Not quite. Unlike bigwig/bedgraph which stores non-overlapping quantitative intervals along the genome, bigbed/bed can contain overlapping intervals with lots of attributes. Therefore, the flat signal obtained from .fetch() for a bigBed file corresponds to the interval coverage (number of intervals in the file overlapping a given base or bin), not the relative scores. This is also what bigBedSummary does.

If you want the scores and any other BED fields, you'll have to read out the actual interval records themselves:

In [1]: import bbi                                                                                                                                                                                                        

In [2]: f = bbi.open('http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/JASPAR2020_hg19.bb')                                                                                                                   

In [3]: f.schema                                                                                                                                                                                                          
Out[3]: 
{'name': 'bed',
 'comment': 'Browser Extensible Data',
 'columns': ['chrom', 'start', 'end', 'name', 'score', 'strand'],
 'dtypes': {'chrom': 'object',
  'start': 'uint32',
  'end': 'uint32',
  'name': 'object',
  'score': 'uint32',
  'strand': 'object'},
 'description': {'chrom': 'Reference sequence chromosome or scaffold',
  'start': 'Start position in chromosome',
  'end': 'End position in chromosome',
  'name': 'Name of item.',
  'score': 'Score (0-1000)',
  'strand': '+ or - for strand'}}

In [4]: f.fetch_intervals('chr1', 94586705, 94587705)                                                                                                                                                                     
Out[4]: 
     chrom     start       end    name  score strand
0     chr1  94586689  94586706   NR3C1    382      +
1     chr1  94586689  94586706   NR3C1    397      -
2     chr1  94586689  94586706   NR3C2    396      +
3     chr1  94586689  94586706   NR3C2    415      -
4     chr1  94586697  94586708   NR2F2    294      -
...    ...       ...       ...     ...    ...    ...
3769  chr1  94587702  94587709  NFATC2    285      -
3770  chr1  94587703  94587713    Sox3    298      -
3771  chr1  94587704  94587711   FOXP3    206      +
3772  chr1  94587704  94587714  BARHL2    201      +
3773  chr1  94587704  94587714    SOX4    253      +

[3774 rows x 6 columns]