lzamparo / embedding

Learning semantic embeddings for TF binding preferences directly from sequence
Other
0 stars 0 forks source link

Re-focus to start embedding ATAC-seq sub-peaks #16

Open lzamparo opened 6 years ago

lzamparo commented 6 years ago

based on discussion with C today, I'm going to shift away from embedding SELEX-seq probes and towards embedding shorter windows of ATAC-seq peaks.

To that end, I need to do several things for the data set to be prepared for embedding:

  1. Embedding code needs to take input sequences and make them into sentences of kmers. Currently I have an atlas of regions, but not any sequence regions that underlie them. So, I need to turn my atlas peaks into sequences to be parsed.
  2. I need to write code to extract sub-windows of 50bp from within a given peak, and to compute the corresponding average coverage score.
  3. I need to hack the data prep code to withhold an entire chr for testing, or entire cell-type for testing.
  4. I need to integrate the GC-content bias correction from Basenji (gcapc, in R. Apparently also in Basenji python script)
lzamparo commented 6 years ago

Further to this list, need to write and compare embedding methods:

  1. The word2vec embedder which induces a sub-peak embedding based on its k-mer composition embedding
  2. A vanilla CNN embedding on one-hot encoded sequence
  3. A VAE embedding based on k-mers (or k-mer priors)
  4. An LSTM embedding based on sequence
  5. An LSTM embedding based on k-mers

Need to fix a different issue (TODO: #13 #9 ??)

lzamparo commented 6 years ago

Progress update here:

  1. Embedding code needs to take input sequences and make them into sentences of kmers. Currently I have an atlas of regions, but not any sequence regions that underlie them. So, I need to turn my atlas peaks into sequences to be parsed.
  2. I need to write code to extract sub-windows of 50bp from within a given peak.
  3. I need to hack the data prep code to withhold an entire chr for testing, or entire cell-type for testing.

Above three are done. Scoring sub-peaks by calculating the average library-adjusted coverage is a work in progress. Not sure I have the data for the GM12878 data anyhow.