Considerations from ENCODE meeting

Presented the ChIP-seq peaks vs flanks comparison on Friday ENCODE call today. Jeff and Jacob suggest that HashingVectorizer is not a good comparator for my method since I want something that takes proximity into account, like a more locality sensitive hashing, or random projection.

Jeff had a bunch of suggestions:

experiment with increasing K
experiment with increasing K while also changing stride to something other than 1. Says this scheme is equivalent to Kth order Markov model.
experiment with adding a supervised component to objective. One candidate for supervision would be to minimize the differences of absolute distances calculated by a gappy-k-mer kernel, and Euclidean distances between embedded kmers.
Jeff suggested doing a ton of experiments and seeing which actually lead to more useful representations; at this stage I should do much more exploration and less exploitation.

This suggests another possible measure of validation:

Can I predict open vs closed on the within-cell type task, with a chromosome held out?

lzamparo / embedding

Considerations from ENCODE meeting #18