jmschrei / tfmodisco-lite

A lite implementation of tfmodisco, a motif discovery algorithm for genomics experiments.
MIT License
56 stars 16 forks source link

Fix bug in gapped kmer #55

Open thomasopsomer opened 5 days ago

thomasopsomer commented 5 days ago

Hello @jmschrei,

I had issues in the function that computes scores for the gapped kmers: _extract_gkmers in _seqlet_to_gkmers. It seems that if there is less gapped kmers extracted than the parameter max_entries there is a bug. It's just a quick fix that ensure we don't have size issues but I didn't really think this a lot :)

jmschrei commented 5 days ago

Thanks for the contribution. Would you mind undoing the formatting changes? It's making it challenging to see what the actual contribution is.

thomasopsomer commented 4 days ago

Ah yes sorry it was ruff stuff I removed the formatting

jmschrei commented 3 days ago

My recollection is that the number of gapped k-mers that could be considered for a sequence is not data-dependant. The only filtering being done here is to make sure that each considered gkmer is a valid gkmer. Why wouldn't you just reduce max_entries to a value more appropriate for your data? It is supposed to be much smaller than the number of potential gkmers.

thomasopsomer commented 7 hours ago

I went back to see the data that caused the bug, and I understand better now:

I used a sliding_window_size of 20 and flank_size of 10, which creates to seqlets of length 40. However when computing the affinity matrix (in seqlets_to_patterns), it keeps only the positions of the topn (=20 by default) best positions considering their attribution. By default the max gap size is 15. For sequence of 40 there could be gaps of more than 15 and so it could produce much less gkmers than max_entries...

Yes it's possible get ride of the bug by decreasing flank_size. Or by tweaking the parametersmax_entries, top_n or max_gap in cosine_similarity_from_seqlets, however these parameters are not exposed to the main TFMoDISco method.

So finally I'm wondering if there is much to do ^^. Maybe raising an Error explaining what parameters to change in order to avoid the bug ?