kundajelab / fastISM

In-silico Saturation Mutagenesis implementation with 10x or more speedup for certain architectures.
MIT License
18 stars 3 forks source link

single sequence input fails with IndexError #4

Closed kiramt closed 3 years ago

kiramt commented 3 years ago

Hi - I saw your MLCB talk and was hoping to try out fastism. I was working through the code in your tutorial, as below:

        model = tf.keras.models.load_model("deepseabeluga.h5")
        chr3_enhancer = "CCGGTGATTTTCTGGAGTCTATATCCTTCATCAGATTTTCCAAGGGGTGTCTGTCCCCTCAAAAGAATGATTGTCATTATTTGAAAGACTAG" \
                        "TTCCAGACAGATATTTTATACAAATTTTCCCAGCATTGACATCCCTGAACCAAACTGTTTTTCTTCCCAACATTACTGTTTTCTTCCTTTCT" \
                        "GTCGAGTTTGTTGTTTTGTAATATCAGAATCTCCAGCTCACCTGAGTAAATGGTAACAAGGTGCCACCACCTTTGAATTCTCCCAGAATCCA" \
                        "CCCCACCCTCCGTCAGAGCCACTGCCAAGGCACTCTTACTGATTTCTCCCACACTGCTGGCTCATTGCAAGTGGGAAGACAGCATGTGGAGT" \
                        "GGGTGTGCGGCTTATTAAAGTGAGAACTCAGGGTCAGGGCAGAACCAGGAAGAGAGCAGTGAGATATCCTGCTACCTAATCCAATTCTCCTT" \
                        "TTTGTGCATTTAGCACCCTCCCCTCCGCCTGCATAACAATGGAAGGAAAGAGGAAGTGGGAAAAAAGAAAGTCATGTAATTGAGTTAGAAGA" \
                        "GGTAATGACCAAGACCCTGGAGCAGAGGGAAAGCGGGTTACAAAAGGTGGGTTAAAGAAATCACAAGAGTATGAAGAGCTGGGAAATTACTA" \
                        "ACAAATATTTGCTTGTGTGGGAAAGCAAAAAAGTAAAAACTTCAGTGCTGAATTGGGGCGCTGAGCCACCAGGGAAATTTGAGATTGGCATC" \
                        "AAGGACCGTGTTGAAGCAGGGTGGGCGGAGAAGGAGGGAAAACTACCAGCCAGCTGAGATTTTGCAGCTAGGCTGTGGCCTGATACCGAGTA" \
                        "TCGATGCCGCAAGGGAGGGATGAGTCAGTCCTAGCACGTCCAAGTTTAGAATAATAGACTGTTTGCCACTGGGAAGGCAAACACCTTTCCTG" \
                        "TGAGAGGGCTTGCTGACAGTTCCAATGTCCAAAGTCCAATGCCGACCCAGAAAACTGAGGAGGCCCTGGCCCCTGCAGGAAGGGCTCATTTA" \
                        "CATGGAGACTGAGTAAAGTGCTGTCTTAAACCCTCCTTCCTTCCCCCACTGGGAGGTTTCAGCCAGATATGCCACCCTTTGTAGGATTTCAT" \
                        "AGGGTTGTCTAAAGCCAGGGTTGGCACAGAGCAGAAGCCACAGGGCTAAGTACCAGATTATAATTGTCAATGTCACACCTTACTGCAGAAGC" \
                        "CAGGGAAGGGAGCTAGGAAACTGAAGAGCTTTCTTGGTTATGGGCGGGGCTGTAAATGCAGAGTGTGCCCTGGTGACTCATGGGAGACAGTG" \
                        "AGAAACACTGTGGGGATCTGGTCAACCGGGTACTGATTCCTTTGAGGAAGGTATACTCCACATGCCAACCTGATACTCATGGCTAGTGAAGA" \
                        "GATGGCAGGATTGGGTTGCATCAGCCAGCCTAACTCGACTTGGAAACACAGAAAATAACCCAGAGCAGGTCTCAAGCACTGTGTAACTTTAT" \
                        "TAGTTCATAGTGGCTGAACAGCCATGTTTAGGGCCTCTCAGAAGAAAGAGTTTCATCTTTGGGAAGAAATTTGTGTTGGGTGATTTTGTTCA" \
                        "TATAATTTTGTGTTTTTTGTTTTGTTTTGGTGTTTGAGACAGGGCCTCACTCTCTCACACAGGCTGGAGTGCAGTGGCACCATCTTAGCTCA" \
                        "CTGCAACCTCTACCTTCCTGCCTCAAGCGATCCTCCTACTTCAGCCTCCTGCATAGCTGGGACTACAGGCACGTATCACTCAACCCAGCTAA" \
                        "TTTTTTTTTTTTCGAGATGCAGTCTTGCTCTGTCACCCAGGCTGGAGAGCAATGGCACTATCTTGGCTCACTGTAACCCCCGCCTCCCAGTC" \
                        "TCTGCCTCCTGAGTAGCTGGGATTACAGGCTCCTGCCACCACCCCCGGCTCAGCTAATTATTTCTTTCTTTCTTTTTTCTGAGATGAAGTTT" \
                        "CACTCTTGTTGCCCAGGCTGGAGTGCAATGGCACGATCTCAGCTCACTGCAATGTCTGCTTCTGGGGT"

        sequences = [chr3_enhancer]*1

        #We define a function to do the one-hot encoding
        onehot_mapping = {
            'A': [1,0,0,0],
            'C': [0,1,0,0],
            'G': [0,0,1,0],
            'T': [0,0,0,1],
            'N': [0,0,0,0],
            'a': [1,0,0,0],
            'c': [0,1,0,0],
            'g': [0,0,1,0],
            't': [0,0,0,1],
        }
        def one_hot_encode(sequence):
            return np.array([onehot_mapping[x] for x in sequence])

        onehot_sequences = np.array([one_hot_encode(x) for x in sequences])

        x = tf.constant(onehot_sequences, dtype=model.input.dtype)
        mutations = [[1,0,0,0],
                     [0,1,0,0],
                     [0,0,1,0],
                     [0,0,0,1]]

        from fastism import FastISM

        fast_ism_model = FastISM(model, test_correctness=False)

        fast_ism_out = [fast_ism_model(x, replace_with=mut) for mut in mutations]

It runs fine when I supply 5 x chr3_enhancer but if I make it a batch of 1 sequence I get the following error:

Traceback (most recent call last):
  File "...test.py", line 328, in test_example
    fast_ism_out = [fast_ism_model(x, replace_with=mut) for mut in mutations]
  File "...test.py", line 328, in <listcomp>
    fast_ism_out = [fast_ism_model(x, replace_with=mut) for mut in mutations]
  File "...python3.7/site-packages/fastism/ism_base.py", line 78, in __call__
    ism_ith_output = self.get_ith_output(inp_batch, i, idxs_to_mutate)
  File "...python3.7/site-packages/fastism/fast_ism.py", line 68, in get_ith_output
    fast_ism_inputs = self.prepare_ith_input(self.padded_inputs, i, idxs_to_mutate)
  File "...python3.7/site-packages/fastism/fast_ism.py", line 73, in prepare_ith_input
    num_to_mutate = idxs_to_mutate.shape[0]
  File "...python3.7/site-packages/tensorflow/python/framework/tensor_shape.py", line 887, in __getitem__
    return self._dims[key].value
IndexError: list index out of range
suragnair commented 3 years ago

Hi Kira, thanks for trying fastISM out! You're right, it seems to be bugging out for a batch size of 1. I'll look into it.

fastISM runs optimally when GPU memory is maxed out and is run on the most sequences possible in a batch. For small batch sizes it is quite possible it would end up being slower than a standard implementation (due to overheads). If you could describe your use case roughly I may be able to offer more help.

kiramt commented 3 years ago

Thanks Surag. I'd expect mostly I'd be running with larger batch sizes anyway, and could fall back on the standard implementation if a small batch was required. I was using the single sequence (with my own model etc) just as a check that I had my input and output processing set up correctly, so then went back to the tutorial when I was getting an error to see if I had done something wrong.

suragnair commented 3 years ago

Sounds good! Please don't hesitate to reach out if you get stuck. I'll get to the batch size 1 case soon.

kiramt commented 3 years ago

Hi Surag, I also get the same error if I input 2 sequences which are not identical e.g. if I set chr3_enhancer_a to chr3_enhancer but with the first base set to G instead, and have sequences = [chr3_enhancer, chr3_enhancer_a].

suragnair commented 3 years ago

Hi Kira, I've pushed some fixes to v0.4.2. Please give it a try and let me know if it works. Thanks!

kiramt commented 3 years ago

Thanks Surag - that seems to have fixed it!

suragnair commented 3 years ago

Great, thanks!