kundajelab / deeplift

Public facing deeplift repo
MIT License
805 stars 162 forks source link

genomics training example #122

Open saralinker opened 2 years ago

saralinker commented 2 years ago

Do you have an example script for training the genomics model? I am attempting to apply this approach more broadly, and am starting by trying to replicate your example, but my weights are not correct. Any help would be much appreciated!

I'm including the code here that I've written (repurposed from your code) to try to train the model in case that is helpful:

#####################################

Training Genomics Model

#####################################

from future import print_function import tensorflow print("Tensorflow version:", tensorflow.version) import keras print("Keras version:", keras.version) import numpy as np print("Numpy version:", np.version)

from tensorflow.keras.models import model_from_json import simdna.synthetic as synthetic

#####################################

Import Model Architecture from the original DeepLift code

##################################### keras_model_json = "keras2_conv1d_record_5_model_PQzyq_modelJson.json" keras_model = model_from_json(open(keras_model_json).read()) keras_model_config = keras_model.get_config()

model_empty = tensorflow.keras.Sequential().from_config(keras_model_config)

#####################################

Convert Training Set to One Hot Encoding

#################################### def one_hot_encode_along_channel_axis(sequence): to_return = np.zeros((len(sequence),4), dtype=np.int8) seq_to_one_hot_fill_in_array(zeros_array=to_return, sequence=sequence, one_hot_axis=1) return to_return

def seq_to_one_hot_fill_in_array(zeros_array, sequence, one_hot_axis): assert one_hot_axis==0 or one_hot_axis==1 if (one_hot_axis==0): assert zeros_array.shape[1] == len(sequence) elif (one_hot_axis==1): assert zeros_array.shape[0] == len(sequence)

will mutate zeros_array

for (i,char) in enumerate(sequence):
    if (char=="A" or char=="a"):
        char_idx = 0
    elif (char=="C" or char=="c"):
        char_idx = 1
    elif (char=="G" or char=="g"):
        char_idx = 2
    elif (char=="T" or char=="t"):
        char_idx = 3
    elif (char=="N" or char=="n"):
        continue #leave that pos as all 0's
    else:
        raise RuntimeError("Unsupported character: "+str(char))
    if (one_hot_axis==0):
        zeros_array[char_idx,i] = 1
    elif (one_hot_axis==1):
        zeros_array[i,char_idx] = 1

read in the data in the training set

data_filename = "sequences.simdata" train_ids_fh = open("test.txt","r") ids_to_load = [x.rstrip("\n") for x in train_ids_fh]

read_simdata_file adds three lists: ids, sequences, embeddings, and labels

data = synthetic.read_simdata_file(data_filename, ids_to_load=ids_to_load)

onehot_data = np.array([one_hot_encode_along_channel_axis(seq) for seq in data.sequences])

#####################################

Train Model

####################################

model_empty.compile(loss="mse", optimizer="sgd") model_empty.fit(onehot_data, data.labels) model_empty.save_weights("new_model.h5", save_format='h5')

AvantiShri commented 2 years ago

Hi @saralinker, sorry for the slow response - I was on medical leave last quarter.

In terms of a tutorial for training genomics models, I think this notebook by Ziga Avsec is a good place to start; it trains a very simple model with 1 convolutional layer, but hopefully it's enough to give you a grounding: https://colab.research.google.com/github/Avsecz/DL-genomics-exercise/blob/master/Simulated.ipynb. Note that colab notebooks currently default to tensorflow version 2, and if you want to force an earlier version of tensorflow you need to execute the command %tensorflow_version 1.x at the beginning of the notebook.

When you say your "weights are not correct", can you be more specific? In case you were running into an hdf5 error with reading the model weights, this was because the model weights were saved with an earlier version of the h5py library; you have to use h5py < 3.0.0 for reading the weights to work. I have updated the example colab notebook in the deeplift repo to reflect this: https://colab.research.google.com/github/kundajelab/deeplift/blob/master/examples/genomics/genomics_simulation.ipynb

In terms of interpretation, if you have trouble using this particular deeplift repository, then you might have more luck using the DeepSHAP implementation (DeepSHAP is an extension of deeplift, and the implementation is done in a more flexible way such that it works with a wider array of models). I have an example notebook using DeepSHAP here: https://colab.research.google.com/github/AvantiShri/shap/blob/5fdad0651176cdbf1acd6c697604a71529695211/notebooks/deep_explainer/Tensorflow%20DeepExplainer%20Genomics%20Example%20With%20Hypothetical%20Importance%20Scores.ipynb. I also have detailed slides from a lab meeting I gave on using DeepSHAP, in case those are helpful: https://docs.google.com/presentation/d/1JCLMTW7ppA3Oaz9YA2ldDgx8ItW9XHASXM1B3regxPw/edit?usp=sharing