kundajelab / tfmodisco

TF MOtif Discovery from Importance SCOres
MIT License
122 stars 29 forks source link

self.seqlet_aggregator return empty list #6

Open s6juncheng opened 6 years ago

s6juncheng commented 6 years ago

Hi @AvantiShri

Computed threshold 2.1903742329729723
378 coords remaining after thresholding
After resolving overlaps, got 378 seqlets
1 activity patterns with support >= 200 out of 3 possible patterns
Metacluster sizes:  [378]
Idx to activities:  {0: '-1'}
On metacluster 0
Metacluster size 378
Relevant tasks:  ('task0',)
Relevant signs:  (-1,)
(Round 1) num seqlets: 378
(Round 1) Computing coarse affmat
Beginning embedding computation
Computing embeddings
Finished embedding computation in 17.45 s
Starting affinity matrix computations
Normalization computed in 0.34 s
Cosine similarity mat computed in 0.56 s
Normalization computed in 0.34 s
Cosine similarity mat computed in 0.55 s
Finished affinity matrix computations in 1.33 s
(Round 1) Compute nearest neighbors from coarse affmat
Computed nearest neighbors in 0.01 s
(Round 1) Computing affinity matrix on nearest neighbors
Launching nearest neighbors affmat calculation job
Job completed in: 4.04 s
Launching nearest neighbors affmat calculation job
Job completed in: 4.01 s
(Round 1) Computed affinity matrix on nearest neighbors in 8.64 s
Filtered down to 309 of 378
(Round 1) Retained 309 rows out of 378 after filtering
(Round 1) Computing density adapted affmat
[t-SNE] Computing 31 nearest neighbors...
[t-SNE] Indexed 309 samples in 0.000s...
[t-SNE] Computed neighbors for 309 samples in 0.001s...
[t-SNE] Computed conditional probabilities for sample 309 / 309
[t-SNE] Mean sigma: 0.000000
(Round 1) Computing clustering
Beginning preprocessing + Louvain
Wrote graph to binary file in 0.027967453002929688 seconds
Running Louvain modularity optimization
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.2s
Louvain completed 200 runs in 2.220554828643799 seconds
[Parallel(n_jobs=20)]: Done 200 out of 200 | elapsed:    1.5s finished
Wrote graph to binary file in 0.02801656723022461 seconds
Running Louvain modularity optimization
After 1 runs, maximum modularity is Q = 0.887512
Louvain completed 51 runs in 2.2180330753326416 seconds
Preproc + Louvain took 4.560196876525879 s
Got 14 clusters after round 1
Counts:
{0: 39, 1: 35, 2: 33, 3: 29, 4: 25, 5: 24, 6: 23, 7: 22, 8: 19, 9: 18, 10: 16, 11: 10, 12: 9, 13: 7}
(Round 1) Aggregating seqlets in each cluster
Aggregating for cluster 0 with 39 seqlets
Trimmed 0 out of 39
Skipped 39 seqlets
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-107-9e507e90d86f> in <module>()
     42     contrib_scores={'task0': scores},
     43     hypothetical_contribs={'task0': hyp_scores},
---> 44     one_hot=exon)

/opt/modules/i12g/anaconda/3-5.0.1/envs/splicing/lib/python3.5/site-packages/modisco/tfmodisco_workflow/workflow.py in __call__(self, task_names, contrib_scores, hypothetical_contribs, one_hot)
    216                 other_comparison_track_names=[])
    217 
--> 218             seqlets_to_patterns_result = seqlets_to_patterns(metacluster_seqlets)
    219             metacluster_idx_to_submetacluster_results[metacluster_idx] =\
    220                 SubMetaclusterResults(

/opt/modules/i12g/anaconda/3-5.0.1/envs/splicing/lib/python3.5/site-packages/modisco/tfmodisco_workflow/seqlets_to_patterns.py in __call__(self, seqlets)
    598                     sys.stdout.flush()
    599                 motifs = self.seqlet_aggregator(cluster_to_seqlets[i])
--> 600                 assert len(motifs)==1
    601                 motif = motifs[0]
    602                 if (self.sign_consistency_func(motif)):

AssertionError: 

When debuging, len(motifs) is 0. cluster_to_seqlets looks normal but seqlet_aggregator gives a empty list.

AvantiShri commented 6 years ago

Hi @s6juncheng,

You'll notice that it says "skipped 39 seqlets". A seqlet is skipped when, after the network tries to expand the seqlet on either side, the seqlet coordinates end up going off the sequence. Can you give me a bit more context such as the lengths of the regions you are running MoDISco on?

The original size of the seqlets is give by sliding_window_size+flank_to_add (these are arguments to modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow). These seqlets are clustered, and then the clusters are supplied to the aggregator. The aggregator greedily aligns the seqlets and averages them. One issue that comes up here is that seqlets will often only partially overlap, so how do you take an average when not all seqlets overlap all positions? The aggregator solves this by expanding the seqlets as needed so that every seqlets overlaps with every position in the final aggregation. Because you supply the original importance score track data, the aggregator is normally able to do this expansion without issue. The only exception is when expanding the seqlets requires going outside the edge of the provided sequence data. In this situation, the aggregator just skips the seqlet. This is usually ok as very few seqlets get discarded. Unfortunately, in your case it looks like all the seqlets wind up being discarded during this expansion step (it says "Skipped 39 seqlets" and there were only 39 seqlets in that cluster).

I think you can alleviate this by either supplying importance scores for wider regions, or reducing the size of the seqlets. Basically, sliding_window_size+flank_to_add must be a good bit smaller than the size of the full sequence. However, if this is not possible (i.e. your sequences are small because you're using PBM or SELEX data or something like that), I can modify the code to just expand the seqlets as far as possible without ever discarding them.

s6juncheng commented 6 years ago

Hi @AvantiShri thanks for elaborating. The issue arise become many of the motifs are on the edge of the sequence. I'm wondering whether N padding the sequence will solve the problem.

AvantiShri commented 6 years ago

Hi @s6juncheng,

Padding the importance score tracks with zeros (which I guess would be the array equivalent of padding with Ns) would indeed likely get rid of the error, but it's a less-than-ideal solution since I don't think you'd want to include the zeros when you are doing the averaging. However, it's worth trying to get an initial set of results, and if the zero padding isn't cutting it, let me know and I can look into modifying the code.