Closed nickackerman42 closed 5 months ago
I'm not sure what exactly the issue is but the first two things I would try are putting your motif in the middle of the sequence so that you don't encounter any weird edge effects and setting the background attribution value to something small, like np.random.randn(num_sequences, 4, length) * 0.1
instead of zeroes.
If you use the tangermeme
package (https://github.com/jmschrei/tangermeme) you can easily make random one hot encodings to use as background with
from tangermeme.utils import random_one_hot
X = random_one_hot((num_sequences, 4, length), random_state=0).float()
that should account for any weird artifacts of only having two nucleotides in your sequence
Thank you for the response!
I gave your suggestions a try like this, but seem to be getting the same error:
import numpy as np
import modiscolite
from tangermeme.utils import random_one_hot
num_sequences = 1000
length = 50
start = 25
end = 30
sequences = random_one_hot((num_sequences, 4, length), random_state=0, dtype='float')
sequences = sequences.detach().numpy()
# let AAAAA be present in the first 5 chars of all sequences
sequences[:, :, start:end] = 0
sequences[:, 0, start:end] = 1
hyp_contrib_scores = np.random.randn(num_sequences, 4, length) * 0.1
# let AAAAA have high contribution scores in all sequences
hyp_contrib_scores[:, 0, start:end] = 1
pos_sequences, neg_sequences = modiscolite.tfmodisco.TFMoDISco(
hypothetical_contribs=hyp_contrib_scores,
one_hot=sequences,
verbose=True,
)
Something about the input dimensions seems wrong. After running this, I tried printing some things in the tf-modiscolite codebase.
In the extract_seqlets
function of extract_seqlets.py
, I printed the following two variables:
def extract_seqlets(attribution_scores, window_size, flank, suppress,
target_fdr, min_passing_windows_frac, max_passing_windows_frac,
weak_threshold_for_counting_sign):
print(attribution_scores.shape)
pos_values, neg_values, smoothed_tracks = _smooth_and_split(
attribution_scores, window_size)
print(smoothed_tracks.shape)
I get:
(1000, 4)
(1000, 0)
So, somehow, after smoothing, we end up with a 1000 x 0 array. I tried different dimension orders for the sequences and contribution scores and am able to get past the error when I add these lines reordering the dimensions:
sequences = np.transpose(sequences, (1, 2, 0))
hyp_contrib_scores = np.transpose(hyp_contrib_scores, (1, 2, 0))
However, pos_sequences
and neg_sequences
(returned from TFModisco) are both None after this change.
Any ideas? I know this is a lot. No worries if you're not sure what's going on.
Oh, right. I keep getting mixed up internally because PyTorch has the length dimension be the last dim but tfmodisco's internals were written with TensorFlow/numpy in mind where the length dimension is the second to last one. See this line which re-orders the dimensions for the command-line tool: https://github.com/jmschrei/tfmodisco-lite/blob/main/modisco#L183
When I run your example I get it identifying seqlets -- so there isn't a problem up to that point.
I think the issue is that modisco gets confused when there is only one pattern and, because it can't really distinguish a cluster of seqlets with only one pattern from just background noise and it doesn't want to oversegment it, it doesn't do anything. If you add more patterns it seems to identify at least one pattern. Note that you need to space the seqlets out a little bit if you want it to identify all of them. Seqlet calling identifies contiguous 30bp regions regardless of the actual length of the seqlet, and if a seqlet call overlaps with the neighbors 30bp block the neighbor doesn't get called.
import numpy as np
import modiscolite
from tangermeme.utils import random_one_hot
num_sequences = 1000
length = 200
start1, end1 = 55, 65
start2, end2 = 90, 100
start3, end3 = 120, 130
sequences = random_one_hot((num_sequences, 4, length), random_state=0, dtype='float')
sequences = sequences.detach().numpy()
# let AAAAA be present in the first 5 chars of all sequences
sequences[:, :, start1:end1] = 0
sequences[:, 0, start1:end1] = 1
sequences[:, :, start2:end2] = 0
sequences[:, 1, start2:end2] = 1
sequences[:, :, start3:end3] = 0
sequences[:, 2, start3:end3] = 1
hyp_contrib_scores = np.random.randn(num_sequences, 4, length) * 0.1
# let AAAAA have high contribution scores in all sequences
hyp_contrib_scores[:, 0, start1:end1] = 1
hyp_contrib_scores[:, 1, start2:end2] = 1
hyp_contrib_scores[:, 2, start3:end3] = 1
sequences = sequences.transpose(0, 2, 1)
hyp_contrib_scores = hyp_contrib_scores.transpose(0, 2, 1)
pos_sequences, neg_sequences = modiscolite.tfmodisco.TFMoDISco(
hypothetical_contribs=hyp_contrib_scores,
one_hot=sequences,
verbose=True,
)
print(pos_sequences)
print(neg_sequences)
Thank you, Jacob! This is very helpful.
Hey! Thanks so much for all your hard work on this project. I'm having some trouble constructing a minimal example where motifs are found. I'm sure I'm making some kind of silly mistake (and therefore am not actually identifying an 'issue'), but I'm putting this here in case it's helpful to others.
I tried constructing some sequences where 'AAAAA' is an important motif but am getting back an error. I'm using the format (num sequences, 4, length) because that's the format mentioned in the README. I'm on modiscolite version 2.2.0.
I get the following error:
Can you help me construct a minimal example similar to this where motifs are identified? Thanks!