Closed sb-vsfishman closed 3 years ago
Hi Veniamin
I apologise for the inconvenient legacy format. np.packbits
is called on the binary label vector:
https://github.com/kundajelab/ChromDragoNN/blob/cd25818514ab90fad53c64550eea9f43908f2fbb/preprocess/make_accessibility_joblib.py#L78
The 123 values can be recovered by calling the np.unpackbits
function and slicing 123 values.
You can see the above file to see how the format is created from a much simpler input format.
Hi Sugag,
Thank you for your comment. When I run np.unpackbits
I obtain array of length 50978560
, which is not divisible by 123. Am I doing something wrong?
import joblib
a=joblib.load("dnase.chr19.packbit.joblib")
print(len(np.unpackbits(a["labels"])))
print(len(np.unpackbits(a["labels"])) % 123)
In [5]: a=joblib.load("dnase.chr19.packbit.joblib")
In [6]: a['labels'].shape
Out[6]: (398270, 16)
In [7]: np.unpackbits(a['labels']).shape
Out[7]: (50978560,)
In [8]: np.unpackbits(a['labels'], axis=-1).shape
Out[8]: (398270, 128)
Pass in axis=-1
and slice the first 123 values out of 128.
Dear Kundaje lab,
I'm exploring the data associated with your paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts ". I've downloaded
dnase.chr.packbited.tar.gz
file fromhttp://mitra.stanford.edu/kundaje/projects/seqxgene/
and loaded it usingjoblib
library.Could you please explain the structure of these data? I see that
num_cell_types
value is set to123
, thus I would expect the shape of lablesN_saqs * N_cell_types
; however the shape of labels is(398270, 16)
Thanks in advance, Veniamin