lzamparo / embedding

Learning semantic embeddings for TF binding preferences directly from sequence
Other
0 stars 0 forks source link

Atlas QC work #17

Open lzamparo opened 6 years ago

lzamparo commented 6 years ago

Still need to do a bunch of work on the atlas, some peaks seem spurious

lzamparo commented 6 years ago

I can repurpose code from the safe harbour project to do this, at least the bw track generation

lzamparo commented 6 years ago

Done, need to embed work from Google Sheet here. Upshot is that many peaks still show up which look spurious, but can pass even a very stringent IDR filter. Next steps to reduce number of spurious peaks could be:

  1. Tighten threshold on meta-bam peak calling; might be allowing for too many regions that do not contain real peaks.
  2. Might be that I have to apply a heuristic for rejecting low-coverage peaks post-hoc. Can look at the pan-celltype distribution of normalized peak heights genome-wide and the per-celltype distribution of normalized peak heights. Hopefully there is some modality to be observed, and sensible thresholds that can emerge.
lzamparo commented 6 years ago
Previous # peaks, IDR tournament (0.01) New # peaks, IDR tournament (0.003) Conservative # peaks, longest reproducible list (0.01) Conservative # peaks, longest reproducible list (0.003) Celltype
49759 41848 49056 41525 B cells
70710 59102 55060 48651 CD4+ T cells
70569 59514 64248 55644 CD8+ T cells
53975 45137 48195 41404 CLP
180499 151685 143887 123415 CMP
21214 18300 19444 16983 Erythroblasts
144310 122212 117925 101505 GMP
154618 134102 142287 126323 HSC
109981 94705 98020 85327 LMPP
141581 118702 132164 113193 MEP
55818 48647 51187 45740 Monocytes
137018 117803 118257 103617 MPP
67788 57087 51818 45691 NK cells