r3fang / SnapATAC

Analysis Pipeline for Single Cell ATAC-seq
GNU General Public License v3.0
300 stars 125 forks source link

Peak normalization #137

Open qingnanl opened 4 years ago

qingnanl commented 4 years ago

Dear developer, I used the runMACS to get the aggregated peaks for each cluster, and the question is whether the peaks are normalized by the cell numbers or total reads in the cluster. For example, if I want to show differential peak-to-gene connection, and the cell numbers among clusters are quite different, the overall peak values are already quite different. Thus I have a concern that the comparison may not be as unbiased (at least in visualization). Is it possible that normalization could be made after the aggregation of the peak?

Thanks for creating such a great package: it is efficient, versatile and well documented!

r3fang commented 4 years ago

hello,

Thank you! Unfortunately, the peak calling is not normalized by cell number / read depth. As a result, a cluster of more cells will likely result in more peaks. Therefore, the total number of peaks between clusters may not be directly comparable. The reason for that is because "minor" clusters do not have sufficient reads/cells for robust peak calling. From my opinion, the easiest solution to this problem is to sequence more cells to saturate the peak calling so that the signal is robust and sufficient to identify the total set of peaks for each of the cluster.

On the other hand, if you believe the peak number difference between clusters is indeed biologically relevant rather than technical artifacts, you can also perform down-sampling analysis to demonstrate lower number of cells will not reduce peak number.

qingnanl commented 4 years ago

hello,

Thank you! Unfortunately, the peak calling is not normalized by cell number / read depth. As a result, a cluster of more cells will likely result in more peaks. Therefore, the total number of peaks between clusters may not be directly comparable. The reason for that is because "minor" clusters do not have sufficient reads/cells for robust peak calling. From my opinion, the easiest solution to this problem is to sequence more cells to saturate the peak calling so that the signal is robust and sufficient to identify the total set of peaks for each of the cluster.

On the other hand, if you believe the peak number difference between clusters is indeed biologically relevant rather than technical artifacts, you can also perform down-sampling analysis to demonstrate lower number of cells will not reduce peak number.

Thanks for the reply. I think downsampling may be a good idea. Thanks!