mckellardw / scMuscle

The Cornell Single-Cell Muscle Project (scMuscle) aims to collect, analyze and provide to the research community skeletal muscle transcriptomic data
18 stars 4 forks source link

about raw counts without normalization #5

Closed xflicsu closed 2 years ago

xflicsu commented 2 years ago

Hello @mckellardw! Thanks for sharing the large integrated dataset of muscle scRNA-seq. It is a very valuable referece for further analysis. I downloaded the raw object of seruat. And I found the matrix was normalized. And how to get the raw counts of each cell? Thanks for your help!

mckellardw commented 2 years ago

The data is saved in a Seurat object- the best way to get raw counts is to pull them with the GetAssayData() function:

counts <- GetAssayData(
     scMuscle.seurat,
     assay='RNA',
     slot='counts'
)

This will give you a sparse matrix containing the raw counts, stored in the variable counts here.

xflicsu commented 2 years ago

The data is saved in a Seurat object- the best way to get raw counts is to pull them with the GetAssayData() function:

counts <- GetAssayData(
     scMuscle.seurat,
     assay='RNA',
     slot='counts'
)

This will give you a sparse matrix containing the raw counts, stored in the variable counts here.

Thanks for your response! I follow you segesstion and may not get the raw counts as follows. Firstly, I only get the seurat object named as "scMuscle.slim.seurat". And the read count maybe normalized to ln(cp10k+1) with NormalizeData function. So, anyother way to get those raw counts?

By the way, which version transcriptome reference did you use for scRNA-seq analysis? I also used "Mus_musculus.GRCm38.93.gtf" to calculate the gene expression. But, I found some gene names did not overlap with the scMuscle gene names.

#######################

library("Seurat") load("scMuscle_mm10_slim_v1-1.RData") counts <- GetAssayData( scMuscle.slim.seurat, assay='RNA', slot='counts' ) head(counts)
6 x 365011 sparse Matrix of class "dgCMatrix" [[ suppressing 34 column names ‘Uninjured_WT_1_AAACCTGAGATAGTCA’, >‘Uninjured_WT_1_AAACCTGAGCGTAGTG’, ‘Uninjured_WT_1_AAACCTGAGCTTATCG’ ... ]]

Xkr4 . . . . . . . . . . . . . . . . Sox17 . . . . . . . . . . . . . . . . Mrpl15 . . . . . . . . . 0.9897291 . . 0.9665585 . 0.9271997 . Lypla1 . . . . . . . . . . . . . . 0.9070385 . Tcea1 0.9638307 . . . . . . . . 0.9888475 . . . . 0.9557054 . Gm6104 . . . . . . . . . . . . . . . .

mckellardw commented 2 years ago

Yes, sorry scMuscle.slim.seurat doesn't contain any of the nearest-neighbor graphs, to save disk space and speed up downloads. It is the same data I used in my analysis otherwise.

The reason the counts are not digital is that I used SoupX to remove ambient RNA signal in these data. This step is important for skeletal muscle samples because of the tissue dissociation methods used to prepare samples. The output count matrices were generated with the automated workflow described in the linked repo, and you can find the code I used in R_scripts/scMuscle_github_v1.R.

For the non-overlapping gene names, it may be that you are using GENCODE annotations (tend to contain more genes), while we used UCSC annotations. I uploaded the .gtf file I used to generate the count matrices, but another user has also had issues with non-overlapping genes. If you try to re-analyze your data with these annotations, do you mind providing me the version of cellranger you used to align?

xflicsu commented 2 years ago

Yes, sorry scMuscle.slim.seurat doesn't contain any of the nearest-neighbor graphs, to save disk space and speed up downloads. It is the same data I used in my analysis otherwise.

The reason the counts are not digital is that I used SoupX to remove ambient RNA signal in these data. This step is important for skeletal muscle samples because of the tissue dissociation methods used to prepare samples. The output count matrices were generated with the automated workflow described in the linked repo, and you can find the code I used in R_scripts/scMuscle_github_v1.R.

For the non-overlapping gene names, it may be that you are using GENCODE annotations (tend to contain more genes), while we used UCSC annotations. I uploaded the .gtf file I used to generate the count matrices, but another user has also had issues with non-overlapping genes. If you try to re-analyze your data with these annotations, do you mind providing me the version of cellranger you used to align?

Thank you for this ditail description. You used the cellranger 3.1.0 as the method presented in your paper. I try to combine your data with my own datasets. For the huge datasets in scMuscle, could you provide the raw counts or seurat object of each source separately? The raw counts maybe also popular for further analysis. Thanks again!

mckellardw commented 2 years ago

Unfortunately I have had to clear up space on our server, and since we published I no longer have these data on hand. I have provided all of the SRA download info in scMuscle/supplemental_data/sample_metadata_SupFile1.csv if you would like to regenerate the raw count matrices the information you need should all be there. Otherwise, I'm happy to provide help you to use SoupX to clean up your count matrices (the code I used is in this repo).