kundajelab / ChromDragoNN

Code for the paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts"
MIT License
44 stars 11 forks source link

Structure of dnase.chr.packbited.tar.gz data #7

Closed sb-vsfishman closed 3 years ago

sb-vsfishman commented 3 years ago

Dear Kundaje lab,

I'm exploring the data associated with your paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts ". I've downloaded dnase.chr.packbited.tar.gz file from http://mitra.stanford.edu/kundaje/projects/seqxgene/ and loaded it using joblib library.

Could you please explain the structure of these data? I see that num_cell_types value is set to 123, thus I would expect the shape of lables N_saqs * N_cell_types; however the shape of labels is

(398270, 16) Thanks in advance, Veniamin

suragnair commented 3 years ago

Hi Veniamin

I apologise for the inconvenient legacy format. np.packbits is called on the binary label vector: https://github.com/kundajelab/ChromDragoNN/blob/cd25818514ab90fad53c64550eea9f43908f2fbb/preprocess/make_accessibility_joblib.py#L78 The 123 values can be recovered by calling the np.unpackbits function and slicing 123 values.

You can see the above file to see how the format is created from a much simpler input format.

sb-vsfishman commented 3 years ago

Hi Sugag,

Thank you for your comment. When I run np.unpackbits I obtain array of length 50978560, which is not divisible by 123. Am I doing something wrong?

import joblib
a=joblib.load("dnase.chr19.packbit.joblib")
print(len(np.unpackbits(a["labels"])))
print(len(np.unpackbits(a["labels"])) % 123)

image

suragnair commented 3 years ago
In [5]: a=joblib.load("dnase.chr19.packbit.joblib")                                                                                                                               

In [6]: a['labels'].shape                                                                                                                                                         
Out[6]: (398270, 16)

In [7]: np.unpackbits(a['labels']).shape                                                                                                                                          
Out[7]: (50978560,)

In [8]: np.unpackbits(a['labels'], axis=-1).shape                                                                                                                                 
Out[8]: (398270, 128)

Pass in axis=-1 and slice the first 123 values out of 128.