kundajelab / tfmodisco

TF MOtif Discovery from Importance SCOres
MIT License
122 stars 29 forks source link

question about hypothetical contribs + "cannot do a non-empty take from an empty axes" error #95

Open Sakurag1l opened 3 years ago

Sakurag1l commented 3 years ago

Hi Avanti, When I modify part of the example code and then have an error to be reported here. I don't know why. Could you help me out?

tfmodisco_results = modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow(
                    #Slight modifications from the default settings
                    sliding_window_size=5,
                    flank_size=5,
                    target_seqlet_fdr=0.15,
                    seqlets_to_patterns_factory=
                     modisco.tfmodisco_workflow.seqlets_to_patterns.TfModiscoSeqletsToPatternsFactory(
                        #Note: as of version 0.5.6.0, it's possible to use the results of a motif discovery
                        # software like MEME to improve the TF-MoDISco clustering. To use the meme-based
                        # initialization, you would specify the initclusterer_factory as shown in the
                        # commented-out code below:
                        # initclusterer_factory=modisco.clusterinit.memeinit.MemeInitClustererFactory(
                        #    meme_command="meme", base_outdir="meme_out",
                        #    max_num_seqlets_to_use=10000, nmotifs=10, n_jobs=1),
                        trim_to_window_size=5,
                        initial_flank_to_add=5,
                        final_flank_to_add=5,
                        final_min_cluster_size=6,
                        n_cores=10)
                )(
                 task_names=["0"],#, "task1", "task2"],
                 contrib_scores=task_to_scores,
                 hypothetical_contribs=task_to_hyp_scores,
                 one_hot=one_hot_seqs,
                 null_per_pos_scores=null_per_pos_scores)

IndexError: cannot do a non-empty take from an empty axes.

AvantiShri commented 3 years ago

Hi @Sakurag1l, can you post the full stack trace (i.e. the full error message, which will pinpoint the line of code that threw the error)?

Sakurag1l commented 2 years ago

Thank you. I've solved that problem. But I have a question, if my hypothetical_contribs is the same as contrib_scores, does it matter?

AvantiShri commented 2 years ago

You will still get motifs, but you should expect they will be more split up. Say a motif can have either an A or a T at a particular position; without hypothetical scores, the version with A may be reported as a separate motif from the version with T. The hypothetical scores are meant to help tf-modisco recognize when the underlying motif is the same even though the actual kmer sequence may be different, because the hypothetical scores act as an "autocomplete" of the sequence (in that they reveal what type of pattern the network was detecting at a particular location).

Sakurag1l commented 2 years ago

Thanks! Excuse me, are there other ways to generate hypothesis scores?

AvantiShri commented 2 years ago

There are. What is the method you are currently using to generate your contribution scores?

Sakurag1l commented 2 years ago

Saliency , DeepLIFT and Saturation Mutagenesis

Sakurag1l commented 2 years ago

task_to_hyp_contrib_scores = hypothetical_contribs_many_refs_func( task_idx=1, input_data_sequences=seqs, num_refs_per_seq=10, batch_size=50, progress_update=4000, )

Does this seqs one-hot encoding only support N*4 when I use Deeplift to generate hypothetical scores. When I was training the model, my one-hot seqs shape is4*N*1.

AvantiShri commented 2 years ago

So the basic idea of hypothetical importance scores is that they are an estimate of what the contribution score would be if a different base were present in the sequence at a given position.

If "saliency" = "gradient * input", then the gradient on all the possible bases (i.e. not just the bases that are present in the actual sequence) would be your hypothetical scores. I assume you compute your saliency score right now by calculating the gradients on the input and then masking by the actual one-hot encoded sequence; just don't do the masking step and you have "hypothetical" scores. These scores literally give you the model's sensitivity to the other bases that could be present in the sequence.

If using DeepLIFT, then there are a few ways to generate hypothetical scores:

If you used the original deeplift repository, you can get hypothetical scores as demonstrated in: https://github.com/kundajelab/tfmodisco/blob/master/examples/simulated_TAL_GATA_deeplearning/Generate%20Importance%20Scores.ipynb

To your question, yes, I assumed "N x channels". Can you add some permute/reshape layers to the beginning of your model to get it to take inputs of the "N x 4" shape? If you do that, then you can use the following DeepSHAP notebook to get hypothetical scores (you won't be able to use the deeplift implementation in the deeplift repository as I haven't implemented permute and reshape layers in it, but the DeepSHAP implementation of deeplift will work): https://colab.research.google.com/github/AvantiShri/shap/blob/5fdad0651176cdbf1acd6c697604a71529695211/notebooks/deep_explainer/Tensorflow%20DeepExplainer%20Genomics%20Example%20With%20Hypothetical%20Importance%20Scores.ipynb

If using Saturation Mutagenesis: assuming you mutate each base to all 3 other possible bases, and you record the model output for all the bases, then it should be straightforward to compute what the mutagenesis score would have been if the other possible bases were present. You should compute your saturation mutagenesis scores in a way that retains sign information. What I like to do is to compute the value of the output logit over all 4 bases at each position (for one of the 4 bases this will be the same as the original model output) , and then subtract the mean logit over all 4 from each of the 4 values. Bases that increase the output relative to the mean will get a positive score, and bases that decrease the output will get a negative score.

Does that make sense?

Sakurag1l commented 2 years ago

OK! Thanks!! I'm going to try to use DeepLIFT.