liu-bioinfo-lab / EPCOT

17 stars 4 forks source link

Example of pre-training dataset #3

Closed KazeDog closed 2 months ago

KazeDog commented 11 months ago

Hello. Your code dataset.py shows the process of building your data. However, you don't seem to have given a simple example to show temp_seq, temp_dnase and temp_labels. e.g. what kind of information should be included in train_labels.pickle. If possible, would it be possible to provide a simple example showing train_seqs.pickle, train_%s.pickle and train_labels.pickle?

zzh24zzh commented 11 months ago

Hello. Your code dataset.py shows the process of building your data. However, you don't seem to have given a simple example to show temp_seq, temp_dnase and temp_labels. e.g. what kind of information should be included in train_labels.pickle. If possible, would it be possible to provide a simple example showing train_seqs.pickle, train_%s.pickle and train_labels.pickle?

Hi, You can access the binary epigenome labels for the four pre-training cell lines at this link: https://zenodo.org/records/7485616 under the filename pretrain_target.pickle.

Description of the Data Structure of pretrain_target.pickle file:

  1. The label matrix for each chromosome in a cell line, referenced as ‘d’, is saved as a sparse matrix dictionary.

  2. The matrix ‘d’ is a ‘m x 245’ matrix, m denotes the number of 1kb bins, and 245 is the number of predicted epigenomes. For example, d[102,8] indicates the binary label of TF 9 in the region 102000-103000.

  3. The order of the TF list can be found at https://github.com/liu-bioinfo-lab/EPCOT/blob/main/Data/epigenomes.txt.

  4. If the TF in the specific cell line is not collected, then the specific column in matrix 'd' will be all 0.

Regarding the ‘train_seq’ and ‘train_dnase’ Data:

The ‘train_seq’ data is a ‘m x 4 x 1600’ matrix.

The ‘train_dnase’ data is a ‘m x 1x 1600’ matrix. The following codes may be useful.

import pickle,os
import numpy as np
from scipy.sparse import load_npz,vstack,csr_matrix
def pad_seq_matrix(matrix, pad_len=300):
        # add flanking region to each sample
        paddings = np.zeros((1, 4, pad_len)).astype('int8')
        dmatrix = np.concatenate((paddings, matrix[:, :, -pad_len:]), axis=0)[:-1, :, :]
        umatrix = np.concatenate((matrix[:, :, :pad_len], paddings), axis=0)[1:, :, :]
        return np.concatenate((dmatrix, matrix, umatrix), axis=2)

def pad_signal_matrix(matrix, pad_len=300):
    paddings = np.zeros(pad_len).astype('float32')
    dmatrix = np.vstack((paddings, matrix[:, -pad_len:]))[:-1, :]
        umatrix = np.vstack((matrix[:, :pad_len], paddings))[1:, :]
        return np.hstack((dmatrix, matrix, umatrix))

# danse file can be obtained following the procedure https://github.com/liu-bioinfo-lab/EPCOT/tree/main/Input 
with open(‘GM12878_dnase.pickle','rb') as f:
    dnases=pickle.load(f)

seq_data={}
dnase_data={}

temp_seq_list=[]
temp_dnase_list=[]

for chr in chromosomes:
    # ‘.npz’ reference genome file can be obtained using https://github.com/liu-bioinfo-lab/EPCOT/blob/main/Input/reference_genome.py
    ref_file = os.path.join(ref_path, 'chr%s.npz' % chr)
    ref_gen_data = load_npz(ref_file).toarray().reshape(4, -1,1000).swapaxes(0, 1)
    ref_gen_data = pad_seq_matrix(ref_gen_data)

    dnase_data = dnases[chr].toarray()
    print(dnase_data.shape)
    dnase_data = np.expand_dims(pad_signal_matrix(dnase_data.reshape(-1, 1000)),axis=1)
    temp_seq_list.append(ref_gen_data)
    temp_dnase_list.append(dnase_data)
dnase_data[‘GM12878’]=np.vstack(temp_dnase_list)
seq_data[‘GM12878’]=np.vstack(temp_seq_list)

Then, the data can be randomly splitted into training, validation and testing set. During training, we utilize only the regions that exhibit at least one positive binding event across the 245 epigenomes.

Hope this answers your questions.

KazeDog commented 10 months ago

Hello. Your code dataset.py shows the process of building your data. However, you don't seem to have given a simple example to show temp_seq, temp_dnase and temp_labels. e.g. what kind of information should be included in train_labels.pickle. If possible, would it be possible to provide a simple example showing train_seqs.pickle, train_%s.pickle and train_labels.pickle?

Hi, You can access the binary epigenome labels for the four pre-training cell lines at this link: https://zenodo.org/records/7485616 under the filename pretrain_target.pickle.

Description of the Data Structure of pretrain_target.pickle file:

  1. The label matrix for each chromosome in a cell line, referenced as ‘d’, is saved as a sparse matrix dictionary.
  2. The matrix ‘d’ is a ‘m x 245’ matrix, m denotes the number of 1kb bins, and 245 is the number of predicted epigenomes. For example, d[102,8] indicates the binary label of TF 9 in the region 102000-103000.
  3. The order of the TF list can be found at https://github.com/liu-bioinfo-lab/EPCOT/blob/main/Data/epigenomes.txt.
  4. If the TF in the specific cell line is not collected, then the specific column in matrix 'd' will be all 0.

Regarding the ‘train_seq’ and ‘train_dnase’ Data:

The ‘train_seq’ data is a ‘m x 4 x 1600’ matrix.

  • 'm': number of 1kb bins.
  • '4': four nucleotides (ACGT).
  • '1600' indicates a 1kb bin with 300bp flanking region.

The ‘train_dnase’ data is a ‘m x 1x 1600’ matrix. The following codes may be useful.

import pickle,os
import numpy as np
from scipy.sparse import load_npz,vstack,csr_matrix
def pad_seq_matrix(matrix, pad_len=300):
      # add flanking region to each sample
      paddings = np.zeros((1, 4, pad_len)).astype('int8')
      dmatrix = np.concatenate((paddings, matrix[:, :, -pad_len:]), axis=0)[:-1, :, :]
      umatrix = np.concatenate((matrix[:, :, :pad_len], paddings), axis=0)[1:, :, :]
      return np.concatenate((dmatrix, matrix, umatrix), axis=2)

def pad_signal_matrix(matrix, pad_len=300):
  paddings = np.zeros(pad_len).astype('float32')
  dmatrix = np.vstack((paddings, matrix[:, -pad_len:]))[:-1, :]
      umatrix = np.vstack((matrix[:, :pad_len], paddings))[1:, :]
      return np.hstack((dmatrix, matrix, umatrix))

# danse file can be obtained following the procedure https://github.com/liu-bioinfo-lab/EPCOT/tree/main/Input 
with open(‘GM12878_dnase.pickle','rb') as f:
    dnases=pickle.load(f)

seq_data={}
dnase_data={}

temp_seq_list=[]
temp_dnase_list=[]

for chr in chromosomes:
  # ‘.npz’ reference genome file can be obtained using https://github.com/liu-bioinfo-lab/EPCOT/blob/main/Input/reference_genome.py
  ref_file = os.path.join(ref_path, 'chr%s.npz' % chr)
  ref_gen_data = load_npz(ref_file).toarray().reshape(4, -1,1000).swapaxes(0, 1)
  ref_gen_data = pad_seq_matrix(ref_gen_data)

  dnase_data = dnases[chr].toarray()
  print(dnase_data.shape)
  dnase_data = np.expand_dims(pad_signal_matrix(dnase_data.reshape(-1, 1000)),axis=1)
  temp_seq_list.append(ref_gen_data)
  temp_dnase_list.append(dnase_data)
dnase_data[‘GM12878’]=np.vstack(temp_dnase_list)
seq_data[‘GM12878’]=np.vstack(temp_seq_list)

Then, the data can be randomly splitted into training, validation and testing set. During training, we utilize only the regions that exhibit at least one positive binding event across the 245 epigenomes.

Hope this answers your questions.

Thank you for your answer. We were surprised at the excellent performance of EPCOT on the cross-cell type task. However, how did you mix data from different cell types for training and make predictions on unseen cell types? I noticed that the BatchNorm layer is also updated during the prediction task. This might allow the model to adapt to new cell types. However, does it confuse the model if multiple cell types are trained together during training?

zzh24zzh commented 10 months ago

Hello. Your code dataset.py shows the process of building your data. However, you don't seem to have given a simple example to show temp_seq, temp_dnase and temp_labels. e.g. what kind of information should be included in train_labels.pickle. If possible, would it be possible to provide a simple example showing train_seqs.pickle, train_%s.pickle and train_labels.pickle?

Hi, You can access the binary epigenome labels for the four pre-training cell lines at this link: https://zenodo.org/records/7485616 under the filename pretrain_target.pickle. Description of the Data Structure of pretrain_target.pickle file:

  1. The label matrix for each chromosome in a cell line, referenced as ‘d’, is saved as a sparse matrix dictionary.
  2. The matrix ‘d’ is a ‘m x 245’ matrix, m denotes the number of 1kb bins, and 245 is the number of predicted epigenomes. For example, d[102,8] indicates the binary label of TF 9 in the region 102000-103000.
  3. The order of the TF list can be found at https://github.com/liu-bioinfo-lab/EPCOT/blob/main/Data/epigenomes.txt.
  4. If the TF in the specific cell line is not collected, then the specific column in matrix 'd' will be all 0.

Regarding the ‘train_seq’ and ‘train_dnase’ Data: The ‘train_seq’ data is a ‘m x 4 x 1600’ matrix.

  • 'm': number of 1kb bins.
  • '4': four nucleotides (ACGT).
  • '1600' indicates a 1kb bin with 300bp flanking region.

The ‘train_dnase’ data is a ‘m x 1x 1600’ matrix. The following codes may be useful.

import pickle,os
import numpy as np
from scipy.sparse import load_npz,vstack,csr_matrix
def pad_seq_matrix(matrix, pad_len=300):
        # add flanking region to each sample
        paddings = np.zeros((1, 4, pad_len)).astype('int8')
        dmatrix = np.concatenate((paddings, matrix[:, :, -pad_len:]), axis=0)[:-1, :, :]
        umatrix = np.concatenate((matrix[:, :, :pad_len], paddings), axis=0)[1:, :, :]
        return np.concatenate((dmatrix, matrix, umatrix), axis=2)

def pad_signal_matrix(matrix, pad_len=300):
    paddings = np.zeros(pad_len).astype('float32')
    dmatrix = np.vstack((paddings, matrix[:, -pad_len:]))[:-1, :]
        umatrix = np.vstack((matrix[:, :pad_len], paddings))[1:, :]
        return np.hstack((dmatrix, matrix, umatrix))

# danse file can be obtained following the procedure https://github.com/liu-bioinfo-lab/EPCOT/tree/main/Input 
with open(‘GM12878_dnase.pickle','rb') as f:
    dnases=pickle.load(f)

seq_data={}
dnase_data={}

temp_seq_list=[]
temp_dnase_list=[]

for chr in chromosomes:
    # ‘.npz’ reference genome file can be obtained using https://github.com/liu-bioinfo-lab/EPCOT/blob/main/Input/reference_genome.py
    ref_file = os.path.join(ref_path, 'chr%s.npz' % chr)
    ref_gen_data = load_npz(ref_file).toarray().reshape(4, -1,1000).swapaxes(0, 1)
    ref_gen_data = pad_seq_matrix(ref_gen_data)

    dnase_data = dnases[chr].toarray()
    print(dnase_data.shape)
    dnase_data = np.expand_dims(pad_signal_matrix(dnase_data.reshape(-1, 1000)),axis=1)
    temp_seq_list.append(ref_gen_data)
    temp_dnase_list.append(dnase_data)
dnase_data[‘GM12878’]=np.vstack(temp_dnase_list)
seq_data[‘GM12878’]=np.vstack(temp_seq_list)

Then, the data can be randomly splitted into training, validation and testing set. During training, we utilize only the regions that exhibit at least one positive binding event across the 245 epigenomes. Hope this answers your questions.

Thank you for your answer. We were surprised at the excellent performance of EPCOT on the cross-cell type task. However, how did you mix data from different cell types for training and make predictions on unseen cell types? I noticed that the BatchNorm layer is also updated during the prediction task. This might allow the model to adapt to new cell types. However, does it confuse the model if multiple cell types are trained together during training?

During the pre-training task, we shuffle samples from different cell types, and choose a large batch size to ensure that each batch contains a diverse set of training samples from different cell types. This can allow BatchNorm to better estimate the population statistics for unseen cell types.

However, for downstream tasks like Micro-C prediction, where the input genomic region is large and the batch size is necessarily small, we indeed switch to using batch statistics during the cross-cell type inference by setting the batch size to 1. In this scenario, BatchNorm effectively becomes InstanceNorm, which normalizes each input sample independently across different cell types and helps the model to deal with new cell types. Here is the code for forcing BatchNorm to use batch statistics in the inference stage, which you can find in usage_util.py.

model.eval() 
for m in model.modules(): 
    if m.__class__.__name__.startswith('BatchNorm'): 
        m.train()

Hope this answers your question.

KazeDog commented 10 months ago

During the pre-training task, we shuffle samples from different cell types, and choose a large batch size to ensure that each batch contains a diverse set of training samples from different cell types. This can allow BatchNorm to better estimate the population statistics for unseen cell types.

However, for downstream tasks like Micro-C prediction, where the input genomic region is large and the batch size is necessarily small, we indeed switch to using batch statistics during the cross-cell type inference by setting the batch size to 1. In this scenario, BatchNorm effectively becomes InstanceNorm, which normalizes each input sample independently across different cell types and helps the model to deal with new cell types. Here is the code for forcing BatchNorm to use batch statistics in the inference stage, which you can find in usage_util.py.

Ok. Maybe I still need to understand your processing strategy in depth. Thank you for your reply.