FeatureMatrix() should count cut sites?

nesetozel commented 1 year ago

Dear Tim,

I noticed that when I generated a new peak/barcode matrix with an expanded peak set using Signac, I got significantly lower nCounts_ATAC values compared to the original one I had from CellRanger. I was very confused till I found this: https://github.com/stuart-lab/signac/issues/1119 First, I was wondering if there was a reason for this choice (counting fragments rather than cut sites)? I imagine it probably doesn't make a lot of difference especially with anything downstream of LSI. But cut sites still sound like a better proxy for accessibility for me, so why wouldn't we want to count a fragment twice for a peak if it falls completely within it (as opposed to one whose other end is outside of the peak)? Would it be possible to add this as an option for this function in a future release? I also wanted mention I've been using Signac for quite a while and I didn't really know about this until now. Considering a lot of your users probably use CellRanger to get their data, it may be good to clarify this behavior (and the fact that it's different from CellRanger) better in the documentation, which to me isn't obvious right now.

Thank you, Neset

timoast commented 1 year ago

Hi, I agree we can be clearer on the documentation here. Going forward, we would like to expose a parameter to enable users to decide on a counting method. There are other approaches that may be better, for example paired insertion counting: https://www.biorxiv.org/content/10.1101/2022.04.20.488960v1

nesetozel commented 1 year ago

Thanks Tim, that would be very useful! I'll look into ArchR meanwhile to do this.

PIC would seem to address the most obvious issue with counting fragments, which are the rare cases of long fragments both of whose ends are actually outside a peak and should obviously not be counted at all. But to be honest it's still not clear to me why in the case of counting insertions this "artifact of depleted odd numbers" is a problem or why it is even considered an artifact. This paper does argue the data is easier to model with Poisson when most counts are 1s rather than 2s (which is what you get when counting insertions). I can see how this may improve some downstream applications but I'm still worried that the aggregate counts will be skewed especially for anything that involves pseudobulking the data.

stuart-lab / signac

FeatureMatrix() should count cut sites? #1369