Open thomasopsomer opened 5 days ago
Thanks for the contribution. Would you mind undoing the formatting changes? It's making it challenging to see what the actual contribution is.
Ah yes sorry it was ruff stuff I removed the formatting
My recollection is that the number of gapped k-mers that could be considered for a sequence is not data-dependant. The only filtering being done here is to make sure that each considered gkmer is a valid gkmer. Why wouldn't you just reduce max_entries
to a value more appropriate for your data? It is supposed to be much smaller than the number of potential gkmers.
I went back to see the data that caused the bug, and I understand better now:
I used a sliding_window_size
of 20 and flank_size
of 10, which creates to seqlets of length 40. However when computing the affinity matrix (in seqlets_to_patterns
), it keeps only the positions of the topn
(=20 by default) best positions considering their attribution. By default the max gap size is 15. For sequence of 40 there could be gaps of more than 15 and so it could produce much less gkmers than max_entries
...
Yes it's possible get ride of the bug by decreasing flank_size
. Or by tweaking the parametersmax_entries
, top_n
or max_gap
in cosine_similarity_from_seqlets
, however these parameters are not exposed to the main TFMoDISco method.
So finally I'm wondering if there is much to do ^^. Maybe raising an Error explaining what parameters to change in order to avoid the bug ?
Hello @jmschrei,
I had issues in the function that computes scores for the gapped kmers:
_extract_gkmers
in_seqlet_to_gkmers
. It seems that if there is less gapped kmers extracted than the parametermax_entries
there is a bug. It's just a quick fix that ensure we don't have size issues but I didn't really think this a lot :)