Presented the ChIP-seq peaks vs flanks comparison on Friday ENCODE call today. Jeff and Jacob suggest that HashingVectorizer is not a good comparator for my method since I want something that takes proximity into account, like a more locality sensitive hashing, or random projection.
Jeff had a bunch of suggestions:
experiment with increasing K
experiment with increasing K while also changing stride to something other than 1. Says this scheme is equivalent to Kth order Markov model.
experiment with adding a supervised component to objective. One candidate for supervision would be to minimize the differences of absolute distances calculated by a gappy-k-mer kernel, and Euclidean distances between embedded kmers.
Jeff suggested doing a ton of experiments and seeing which actually lead to more useful representations; at this stage I should do much more exploration and less exploitation.
This suggests another possible measure of validation:
Can I predict open vs closed on the within-cell type task, with a chromosome held out?
Did the random projection experiment comparison: unless I'm not treating the counts -> projetion properly, my embedding features are working generally better:
Presented the ChIP-seq peaks vs flanks comparison on Friday ENCODE call today. Jeff and Jacob suggest that HashingVectorizer is not a good comparator for my method since I want something that takes proximity into account, like a more locality sensitive hashing, or random projection.
Jeff had a bunch of suggestions:
This suggests another possible measure of validation: