Open s6juncheng opened 6 years ago
Hi @s6juncheng,
You'll notice that it says "skipped 39 seqlets". A seqlet is skipped when, after the network tries to expand the seqlet on either side, the seqlet coordinates end up going off the sequence. Can you give me a bit more context such as the lengths of the regions you are running MoDISco on?
The original size of the seqlets is give by sliding_window_size
+flank_to_add
(these are arguments to modisco.tfmodisco_workflow.workflow.TfModiscoWorkflow
). These seqlets are clustered, and then the clusters are supplied to the aggregator. The aggregator greedily aligns the seqlets and averages them. One issue that comes up here is that seqlets will often only partially overlap, so how do you take an average when not all seqlets overlap all positions? The aggregator solves this by expanding the seqlets as needed so that every seqlets overlaps with every position in the final aggregation. Because you supply the original importance score track data, the aggregator is normally able to do this expansion without issue. The only exception is when expanding the seqlets requires going outside the edge of the provided sequence data. In this situation, the aggregator just skips the seqlet. This is usually ok as very few seqlets get discarded. Unfortunately, in your case it looks like all the seqlets wind up being discarded during this expansion step (it says "Skipped 39 seqlets" and there were only 39 seqlets in that cluster).
I think you can alleviate this by either supplying importance scores for wider regions, or reducing the size of the seqlets. Basically, sliding_window_size
+flank_to_add
must be a good bit smaller than the size of the full sequence. However, if this is not possible (i.e. your sequences are small because you're using PBM or SELEX data or something like that), I can modify the code to just expand the seqlets as far as possible without ever discarding them.
Hi @AvantiShri thanks for elaborating. The issue arise become many of the motifs are on the edge of the sequence. I'm wondering whether N padding the sequence will solve the problem.
Hi @s6juncheng,
Padding the importance score tracks with zeros (which I guess would be the array equivalent of padding with Ns) would indeed likely get rid of the error, but it's a less-than-ideal solution since I don't think you'd want to include the zeros when you are doing the averaging. However, it's worth trying to get an initial set of results, and if the zero padding isn't cutting it, let me know and I can look into modifying the code.
Hi @AvantiShri
When debuging,
len(motifs)
is 0.cluster_to_seqlets
looks normal butseqlet_aggregator
gives a empty list.