jmschrei / tfmodisco-lite

A lite implementation of tfmodisco, a motif discovery algorithm for genomics experiments.
MIT License
56 stars 16 forks source link

Trouble constructing minimal example where motifs are found #50

Closed nickackerman42 closed 5 months ago

nickackerman42 commented 5 months ago

Hey! Thanks so much for all your hard work on this project. I'm having some trouble constructing a minimal example where motifs are found. I'm sure I'm making some kind of silly mistake (and therefore am not actually identifying an 'issue'), but I'm putting this here in case it's helpful to others.

I tried constructing some sequences where 'AAAAA' is an important motif but am getting back an error. I'm using the format (num sequences, 4, length) because that's the format mentioned in the README. I'm on modiscolite version 2.2.0.

import numpy as np
import modiscolite

num_sequences = 1000
length = 50

sequences = np.zeros((num_sequences, 4, length))
# let AAAAA be present in the first 5 chars of all sequences
sequences[:, 0, 0:5] = 1
sequences[:, 1, 5:] = 1

hyp_contrib_scores = np.zeros((num_sequences, 4, length))
# let AAAAA have high contribution scores in all sequences
hyp_contrib_scores[:, 0, 0:5] = 1

pos_sequences, neg_sequences = modiscolite.tfmodisco.TFMoDISco(
    hypothetical_contribs=hyp_contrib_scores,
    one_hot=sequences,
    verbose=True,
)

I get the following error:

Traceback (most recent call last):
  File "/Users/nickackerman/code/TFBind/issue.py", line 16, in <module>
    pos_sequences, neg_sequences = modiscolite.tfmodisco.TFMoDISco(
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/modiscolite/tfmodisco.py", line 281, in TFMoDISco
    seqlet_coords, threshold = extract_seqlets.extract_seqlets(
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/modiscolite/extract_seqlets.py", line 158, in extract_seqlets
    pos_null_values, neg_null_values = _laplacian_null(track=smoothed_tracks, 
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/modiscolite/extract_seqlets.py", line 34, in _laplacian_null
    (np.percentile(a=pos_values, q=percentiles_to_use)-mu))
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 4283, in percentile
    return _quantile_unchecked(
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 4555, in _quantile_unchecked
    return _ureduce(a,
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 3823, in _ureduce
    r = func(a, **kwargs)
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 4722, in _quantile_ureduce_func
    result = _quantile(arr,
  File "/Users/nickackerman/.local/share/virtualenvs/TFBind-XRJQk_3F/lib/python3.10/site-packages/numpy/lib/function_base.py", line 4831, in _quantile
    slices_having_nans = np.isnan(arr[-1, ...])
IndexError: index -1 is out of bounds for axis 0 with size 0

Can you help me construct a minimal example similar to this where motifs are identified? Thanks!

jmschrei commented 5 months ago

I'm not sure what exactly the issue is but the first two things I would try are putting your motif in the middle of the sequence so that you don't encounter any weird edge effects and setting the background attribution value to something small, like np.random.randn(num_sequences, 4, length) * 0.1 instead of zeroes.

If you use the tangermeme package (https://github.com/jmschrei/tangermeme) you can easily make random one hot encodings to use as background with

from tangermeme.utils import random_one_hot

X = random_one_hot((num_sequences, 4, length), random_state=0).float()

that should account for any weird artifacts of only having two nucleotides in your sequence

nickackerman42 commented 5 months ago

Thank you for the response!

I gave your suggestions a try like this, but seem to be getting the same error:

import numpy as np
import modiscolite
from tangermeme.utils import random_one_hot

num_sequences = 1000
length = 50
start = 25
end = 30

sequences = random_one_hot((num_sequences, 4, length), random_state=0, dtype='float')
sequences = sequences.detach().numpy()

# let AAAAA be present in the first 5 chars of all sequences
sequences[:, :, start:end] = 0
sequences[:, 0, start:end] = 1

hyp_contrib_scores = np.random.randn(num_sequences, 4, length) * 0.1
# let AAAAA have high contribution scores in all sequences
hyp_contrib_scores[:, 0, start:end] = 1

pos_sequences, neg_sequences = modiscolite.tfmodisco.TFMoDISco(
    hypothetical_contribs=hyp_contrib_scores,
    one_hot=sequences,
    verbose=True,
)

Something about the input dimensions seems wrong. After running this, I tried printing some things in the tf-modiscolite codebase.

In the extract_seqlets function of extract_seqlets.py, I printed the following two variables:

def extract_seqlets(attribution_scores, window_size, flank, suppress, 
    target_fdr, min_passing_windows_frac, max_passing_windows_frac, 
    weak_threshold_for_counting_sign):

    print(attribution_scores.shape)

    pos_values, neg_values, smoothed_tracks = _smooth_and_split(
        attribution_scores, window_size)

    print(smoothed_tracks.shape)

I get:

(1000, 4)
(1000, 0)

So, somehow, after smoothing, we end up with a 1000 x 0 array. I tried different dimension orders for the sequences and contribution scores and am able to get past the error when I add these lines reordering the dimensions:

sequences = np.transpose(sequences, (1, 2, 0))
hyp_contrib_scores = np.transpose(hyp_contrib_scores, (1, 2, 0))

However, pos_sequences and neg_sequences (returned from TFModisco) are both None after this change.

Any ideas? I know this is a lot. No worries if you're not sure what's going on.

jmschrei commented 5 months ago

Oh, right. I keep getting mixed up internally because PyTorch has the length dimension be the last dim but tfmodisco's internals were written with TensorFlow/numpy in mind where the length dimension is the second to last one. See this line which re-orders the dimensions for the command-line tool: https://github.com/jmschrei/tfmodisco-lite/blob/main/modisco#L183

When I run your example I get it identifying seqlets -- so there isn't a problem up to that point.

I think the issue is that modisco gets confused when there is only one pattern and, because it can't really distinguish a cluster of seqlets with only one pattern from just background noise and it doesn't want to oversegment it, it doesn't do anything. If you add more patterns it seems to identify at least one pattern. Note that you need to space the seqlets out a little bit if you want it to identify all of them. Seqlet calling identifies contiguous 30bp regions regardless of the actual length of the seqlet, and if a seqlet call overlaps with the neighbors 30bp block the neighbor doesn't get called.

import numpy as np
import modiscolite
from tangermeme.utils import random_one_hot

num_sequences = 1000
length = 200
start1, end1 = 55, 65
start2, end2 = 90, 100
start3, end3 = 120, 130

sequences = random_one_hot((num_sequences, 4, length), random_state=0, dtype='float')
sequences = sequences.detach().numpy()

# let AAAAA be present in the first 5 chars of all sequences
sequences[:, :, start1:end1] = 0
sequences[:, 0, start1:end1] = 1

sequences[:, :, start2:end2] = 0
sequences[:, 1, start2:end2] = 1

sequences[:, :, start3:end3] = 0
sequences[:, 2, start3:end3] = 1

hyp_contrib_scores = np.random.randn(num_sequences, 4, length) * 0.1
# let AAAAA have high contribution scores in all sequences
hyp_contrib_scores[:, 0, start1:end1] = 1
hyp_contrib_scores[:, 1, start2:end2] = 1
hyp_contrib_scores[:, 2, start3:end3] = 1

sequences = sequences.transpose(0, 2, 1)
hyp_contrib_scores = hyp_contrib_scores.transpose(0, 2, 1)

pos_sequences, neg_sequences = modiscolite.tfmodisco.TFMoDISco(
    hypothetical_contribs=hyp_contrib_scores,
    one_hot=sequences,
    verbose=True,
)

print(pos_sequences)
print(neg_sequences)
nickackerman42 commented 5 months ago

Thank you, Jacob! This is very helpful.