Re-focus to start embedding ATAC-seq sub-peaks

lzamparo commented 6 years ago

based on discussion with C today, I'm going to shift away from embedding SELEX-seq probes and towards embedding shorter windows of ATAC-seq peaks.

To that end, I need to do several things for the data set to be prepared for embedding:

Embedding code needs to take input sequences and make them into sentences of kmers. Currently I have an atlas of regions, but not any sequence regions that underlie them. So, I need to turn my atlas peaks into sequences to be parsed.
I need to write code to extract sub-windows of 50bp from within a given peak, and to compute the corresponding average coverage score.
I need to hack the data prep code to withhold an entire chr for testing, or entire cell-type for testing.
I need to integrate the GC-content bias correction from Basenji (gcapc, in R. Apparently also in Basenji python script)

lzamparo commented 6 years ago

Further to this list, need to write and compare embedding methods:

The word2vec embedder which induces a sub-peak embedding based on its k-mer composition embedding
A vanilla CNN embedding on one-hot encoded sequence
A VAE embedding based on k-mers (or k-mer priors)
An LSTM embedding based on sequence
An LSTM embedding based on k-mers

Need to fix a different issue (TODO: #13 #9 ??)

lzamparo commented 6 years ago

Progress update here:

Embedding code needs to take input sequences and make them into sentences of kmers. Currently I have an atlas of regions, but not any sequence regions that underlie them. So, I need to turn my atlas peaks into sequences to be parsed.
~~I need to write code to extract sub-windows of 50bp from within a given peak.~~
~~I need to hack the data prep code to withhold an entire chr for testing, or entire cell-type for testing.~~

Above three are done. Scoring sub-peaks by calculating the average library-adjusted coverage is a work in progress. Not sure I have the data for the GM12878 data anyhow.

lzamparo / embedding

Re-focus to start embedding ATAC-seq sub-peaks #16