r3fang / SnapATAC

Analysis Pipeline for Single Cell ATAC-seq
GNU General Public License v3.0
305 stars 126 forks source link

Converting pmat to Cicero CDS #42

Closed Austin-s-h closed 5 years ago

Austin-s-h commented 5 years ago

Hello, I'm sorry to bother you again. I'm having some trouble exporting the GRanges object of my peak matrix (pmat) from my snap object to a format usable in Cicero.

From their documentation, they want a 3-column file in the following format. chr1_start_end | cellBarcode | readsInPeak

I know that this data is available in the object, but I'm not sure how to compile it all correctly while maintaining the cell barcodes. If you could help me in retrieving this information that would be very helpful. Additionally, I think this would be good to include in some of your examples as this may help in integrating other tools with yours.

Thank you!

r3fang commented 5 years ago

Hi there,

barcode x.sp@barcode peak position x.sp@peak SnapATAC does not provide readsInPeak, can you tell me a bit more about this column?

I am happy to get SnapATAC compatible with other toolset if it useful, for instance, chromVAR has been included. Will take sometime to test Cicero.

Best, -R

Austin-s-h commented 5 years ago

Following https://cole-trapnell-lab.github.io/cicero-release/docs/#loading-your-data I need a way to retain the information so that there are multiple instances of the peak (from the consensus x.sp@peak) labeled with which cell and the reads (from BAMs or the snap object?) that that individual cell had in that peak.

So... chr1_peak1 | cell1 | 1 chr1_peak1 | cell 2 | 0 chr1_peak2 | cell1 | 6 ...etc

So while x.sp@barcode provides the list of barcodes in my object, it is formatted in such a way that it is associated with all of the peaks from that cell. Does this make sense?

r3fang commented 5 years ago

Yes, you are right. It is the peak i and cell j and number of reads in cell j that overlap with peak i.

It will take some work to re-formulate a snap object as the input for Cicero. Currently it is not what I am planning to do in the short term. Sorry.

Austin-s-h commented 5 years ago

Well, I believe I have found a solution. Feel free to implement this within your package (I don't know how github works enough to do it myself).

#Turn peak Grange into a dataframe
df <- data.frame(seqnames=seqnames(peaks.gr),
                 starts=start(peaks.gr)-1,
                 ends=end(peaks.gr),
                 names=c(rep(".", length(peaks.gr))),
                 scores=c(rep(".", length(peaks.gr))),
                 strands=strand(peaks.gr))

#Turn sparse matrix into matrix, then transpose and turn to dataframe
peak.matrix <- as(object = x.sp@pmat, Class = 'matrix')
peak.dataframe <- as.data.frame(t(peak.matrix))

#assign column names corresponding to the cell barcodes.
colnames(peak.dataframe) <- x.sp@barcode

#peek at it.
peak.dataframe[0:5,0:5]

#Assign rownames (they'll be lost), and a new column as a factor of the peaks in Cicero format.
rownames(peak.dataframe) <- paste0(df$seqnames,"_",df$starts,"_",df$ends)
peak.dataframe$peak <- factor(rownames(peak.dataframe))

#check to make sure that it is a factor.
peak.dataframe[0:5,ncol(peak.dataframe)]

#Now, convert this to long format suitable for cicero input

library(tidyr)
peak.long <- gather(peak.dataframe, key = cell, reads, colnames(peak.dataframe)[1]:colnames(peak.dataframe)[ncol(peak.dataframe)-1], factor_key = TRUE)
head(peak.long)
tail(peak.long)

I have to run some sanity checks for my assumptions that everything is kept in the correct order, but I think that this should work :)

r3fang commented 5 years ago

Many thanks for sharing! I will check it out and integrate into the pipeline when I get some time. Again. thanks for sharing

alfonsosaera commented 4 years ago

Hi,

I am interested in this too. @r3fang, did you already included in SnapATAC?

Thanks!