kundajelab / ChromDragoNN

Code for the paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts"
MIT License
44 stars 11 forks source link

what's the reason behind these filter and padding schemes? #2

Closed ccchang0111 closed 5 years ago

ccchang0111 commented 5 years ago

Hello Surag,

Thanks for the great paper and the codes! Really appreciate researchers share their work with their codes!

I see that the code uses filter sets of either (3, 1) (7, 1) (1, 1). And if I understand it correctly, the sliding window across the axis of 1000 bp is '1'. And the sliding window across one-hot encoded axis is 3 or 7 or 1. Am I correct? If so, what was the reason to use only window size of 1 along the sequence? I thought it should be the other way around; window size of 3 or larger to scan across the 1000 bp. Do the same for each of the one-hot encoded base pair.

Also, what is the reason to treat the one-hot encoded sequence as a 2D matrix? Is there an advantage over using Conv1D with 4 channels?

In your paper's Fig.1, the shape of the layer starts from wide-short and gradually becomes long-skinny. What is the reason for this transformation? Why not just deepen the channels and shrink the width?

Thank you again for the insight,

Andrew

suragnair commented 5 years ago

Hi Andrew

  1. The sliding window is across the 1000 bp sequence.
  2. There's no advantage over using Conv1D, it's effectively the same.
  3. Yes that is indeed what is happening. The width is decreasing (it is 1000 initially and much less towards the end) and the number of channels is increasing (4 initially, then 48, 64, 200). Since the data is 1 dimensional, the channel depth is actually the height of the boxes in the figure.
ccchang0111 commented 5 years ago

Got it. Thanks for clarification!