Binalization needed for the count matrix?

Yoshi-MutoLab commented 5 years ago

Hi, My understanding is there are only two states of a chromatin region, either open or closed (1 or 0) in most areas. The different levels of read counts might largely reflect amplification bias rather than true biology. Do you think we need binalize the count matrix before TF-IDF /SVD processing? Thanks ! Yoshi

timoast commented 5 years ago

I think this will depend on how the data is processed to generate counts. For 10x data processed using cellranger, PCR duplicates are collapsed so any counts >1 should be due to >1 Tn5 integration events within the region. If you have used other methods that don’t collapse PCR duplicates it probably makes more sense to binarize the counts and we have a function called BinarizeCounts in Signac to do this (https://satijalab.org/signac/reference/BinarizeCounts.html)

Yoshi-MutoLab commented 5 years ago

Thanks Tim, I agree. Several fragments from different site could be in one peak detected in CellRangerATAC. I really appreciate your suggestion. Yoshi

Puriney commented 4 years ago

I suggest (always) binarizing the fragment count matrix, because a peak is a genomic region that tends to be open among a subgroup of cells. Here, I always suppose the duplicated reads are collapsed to fragment counts, i.e., Tn5 cut points.

(sc)ATAC-seq is different from ChIP-seq. Though they all use MACS2 for peak calling, the ATAC-seq signal suggests either being Open or Closed thus it is a discrete variable, as ymuto802 said, whereas higher ChIP-seq signals suggest a higher affinity thus the signal is a continuous variable.
A peak should have >1 Tn5 cuts otherwise it cannot be called a peak. The peak calling is not done per cell but rather done by considering scATAC-seq as bulk ATAC-seq. The 10X pipeline firstly calls MACS2 using aggregated fragments to determine peaks, and then counts fragments within peaks for each cell. A side note is that the Greenleaf lab has a two-step peak-calling to identify peaks in rare cells: determine peaks for each coarsely clustered groups and then merge those peaks. In sum, a peak is not short so it is not surprising to see >1 fragments in a peak in a cell.

timoast commented 4 years ago

I don't agree, for the following reasons:

The PCR duplicates are collapsed, so any counts >1 are not due likely to be due to technical artefacts. The motivation for collapsing further is therefore unclear.
If the genome is diploid (as for humans, mouse, many organisms), there are 2 genome copies for most cell types, and so there can be real accessibility counts >1 for any given locus.
A given peak region can be wide enough to encompass multiple DNA fragments from the same cell, corresponding to a real count >1.

Puriney commented 4 years ago

My motivation is to be fair for all peaks at diploid genome.

Take an example. In a given cell, if a 1kb-wide peak (peak-A) has higher fragment count than another 1kb-wide peak (peak-B), does it mean peak-A is more "open" than peak-B? I don't think so. To me, they all support the regions being open. I would consider peak-A and peak-B the same. This is different from UMIs per gene. In addition, I anticipate the length of peaks and their contained fragments number are highly positively correlated.

(image from 10x)

A peak should have >1 Tn5 cuts or >1 fragments.

When converting scATAC signals to gene activity, without binarizing counts, one harm I can anticipate is that it will inflate the signals of being open/positive or having high zcore shown in the UMAP.

However, there exists a feeling of comparison of continuous variable. For example, for a heterogenous bulk ATAC-seq, if peak-A has higher number of fragments than peak-B, I would interpret as that more cells tend to be open at peak-A than peak-B.

Another case is cancer. For a given cell, if the peak-A has multiple copies and all of them are open, whereas the peak-B has only one copy, I would hesitate whether binarizing the fragments. Binarizing will be unfair to peak-A now.

stuart-lab / signac

Binalization needed for the count matrix? #25