Download single cell datasets

stemangiola commented 3 years ago

If you dig in their github repository you will likely find the data already summarised for you in form of R script for reproducibility, so you have to avoid the 95% of work in getting raw single-cell data.

Used by this article: https://www.pnas.org/content/118/22/e2100293118

Segerstolpe A, Palasantza A, Eliasson P, Andersson EM, Andreasson AC, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, Smith DM, Kasper M, Ammala C, Sandberg R. Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab. 2016; 24(4):593–607.

[ ] RNA counts (matrix with genes for rows and cells for columns, integer transcription abundance as values, you need also the dataset for cell annotation, to link cells with subjects).
[ ] Cluster counts (matrix with categories for rows and subjects for columns, and integer count as values)

Delile J, Rayon T, Melchionda M, Edwards A, Briscoe J, Sagner A. Single cell transcriptomics reveals spatial and temporal dynamics of gene expression in the developing mouse spinal cord. Development. 2019. https://doi.org/10.1242/dev.173807.

[x] RNA counts
[x] Cluster counts

Used by this article: https://www.pnas.org/content/118/22/e2100293118

M. Sade-Feldman et al., Defining t cell states associated with response to checkpoint immunotherapy in melanoma.

[ ] RNA counts
[ ] Cluster counts

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120575

[x] RNA counts
[x] Cluster counts

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122043

[x] RNA counts
[x] Cluster counts

K. Gupta et al., Single-cell analysis reveals a hair follicle dermal niche molecular differentiation trajectory that begins prior to morphogenesis. Dev. Cell 48, 17–31 (2019).

[ ] RNA counts
[ ] Cluster counts

X. Fan et al., Single cell and open chromatin analysis reveals molecular origin of epidermal cells of the skin. Dev. Cell 47, 21–37 (2018).

[ ] RNA counts
[ ] Cluster counts

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102086

[x] RNA counts
[ ] Cluster counts

https://ndownloader.figshare.com/files/22927382

[ ] RNA counts
[ ] Cluster counts

R. L. Chua et al., Covid-19 severity correlates with airway epithelium–immune cell interactions identified by single-cell analysis. Nat. Biotechnol. 38, 970–979 (2020).

[ ] RNA counts
[ ] Cluster counts

M. Liao et al., Single-cell landscape of bronchoalveolar immune cells in patients with covid-19. Nat. Med. 26, 842–844 (2020)

[ ] RNA counts
[ ] Cluster counts

https://cells.ucsc.edu/covid19-balf/nCoV.rds

[x] RNA counts
[x] Cluster counts

https://singlecell.broadinstitute.org/single_cell/study/SCP263/aging-mouse-brain#/

[ ] RNA counts
[ ] Cluster counts

M. Ximerakis et al., Single-cell transcriptomic profiling of the aging mouse brain. Nat. Neurosci. 22, 1696–1708 (2019).

[ ] RNA counts
[ ] Cluster counts

Used by this article: https://www.biorxiv.org/content/10.1101/2020.12.14.422688v1.full

https://singlecell.broadinstitute.org/single_cell/study/SCP259

[x] RNA counts
[x] Cluster counts

https://github.com/zhangzlab/covid_balf

[ ] RNA counts
[ ] Cluster counts

https://singlecell.broadinstitute.org/single_cell/study/SCP44

[ ] RNA counts
[ ] Cluster counts

stemangiola commented 3 years ago

Hello @Kirito-Ma ,

this is the rought method to look for data

1) briefly read the article 2) look for a github repository 3) read the README of the github repository 4) see if there is code to reproduce the analyses 5) look at that code and see if there is data that includes transcript abundance 6) is yes download the code+ files, unzip, take what you need 7) if not, look in the paper for data availability 8) go to that repository, and try to understand what type of data is (e.g. raw counts) 9) past the link in this issue, so we start to see how much data is available

stemangiola commented 3 years ago

What data do we need.

1) a table with gene ID as row names, cell ID as column names, and transcript abundance (> 0, < 10000) as values. 2) a table with cell ID as row names, and cell type (T cell) as a column, and sample ID as a column, and factor of interest as column (e.e. healthy and cancer, knock-out vs wild-type).

How to contribute to another study, find the data with this shape

1) a table with sample ID as row names, cell type (T cell) as column names, counts (> 0, < 1000) as value (these are the number of cells in a sample for a cell type). 2) a table with sample ID as row names, factor of interest as column

stemangiola commented 3 years ago

useful databases

You are interested to those who have some sample design, where they are testing differences between conditions.

[ ] CellMarker: a manually curated resource of cell markers in human and mouse
[ ] scRNAseq bioc package Gene-level counts for a collection of public scRNA-seq datasets, provided as SingleCellExperiment objects with cell- and gene-level metadata.
[ ] human cell atlas database
[ ] EMBL-EBI atlas
[ ] (PanglaoDB)[https://panglaodb.se/) is a database for the scientific community interested in exploration of single cell RNA sequencing experiments from mouse and human. We collect and integrate data from multiple studies and present them through a unified framework.
[ ] scRNASeqDBdatabase, which contains 36 human single cell gene expression data sets collected from Gene Expression Omnibus (GEO)
[ ] JingleBellA repository of standardized single cell RNA-Seq datasets for analysis and visualization at the single cell level.
[ ] Broad single cell portal
[ ] The conquer (consistent quantification of external rna-seq data) repository is developed by Charlotte Soneson and Mark D Robinson at the University of Zurich, Switzerland. It is implemented in shiny and provides access to consistently processed public single-cell RNA-seq data sets.
[ ] A curated database reveals trends in single cell transcriptomics Valentine Svensson, Eduardo da Veiga Beltrame bioRxiv 742304; doi: https://doi.org/10.1101/742304

stemangiola commented 3 years ago

Hello @Kirito-Ma you downloaded 3 datasets but you ticked just one, could you update the ticks?

Kirito-Ma commented 3 years ago

Hi Stemangiola/Single_cell_outliers, I have updated my ticks. Thanks for reminding me.

On Fri, Aug 27, 2021 at 10:54 AM Stefano Mangiola @.***> wrote:

Hello @Kirito-Ma https://github.com/Kirito-Ma you downloaded 3 datasets but you ticked just one, could you update the ticks?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stemangiola/single_cell_outliers/issues/1#issuecomment-906885058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2ODENL7DETDGZVGVSSLZ3T6344FANCNFSM5BIAWKRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

stemangiola commented 3 years ago

[ ] https://advances.sciencemag.org/content/7/31/eabh2169

stemangiola commented 3 years ago

@ZijieGA this is a good one for you https://satijalab.org/seurat/articles/multimodal_reference_mapping.html

We load the reference (https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat)

remotes::install_github("mojaveazure/seurat-disk")
library(Seurat)
library(SeuratDisk)
reference <- LoadH5Seurat("../../pbmc_multimodal.h5seurat") 
DimPlot(object = reference, reduction = "wnn.umap", group.by = "celltype.l2", label = TRUE, label.size = 3, repel = TRUE) + NoLegend()

stemangiola commented 3 years ago

@Kirito-Ma @ZijieGA Please compile https://docs.google.com/spreadsheets/d/1En7-UV0k0laDiIfjFkdn7dggyR7jIk3WH8QgXaMOZF0/edit#gid=0

(this is a huge database of single-cell studies, just for your knowledge https://docs.google.com/spreadsheets/d/17Z5j_Oxd21IEyQ1qZ_vXq9FBpG4YrFAq9naEifBEuFw/edit?usp=sharing)

Here some human blood datasets I added to your spreadsheet in another tab "available"

stemangiola commented 3 years ago

Bring datasets to a common format

sample | cell_type | cell_cluster | dataset_id

ZijieGA commented 3 years ago

Hi, @stemangiola, Just a few questions in terms of the cell types, I found some datasets that contain ,T cell, B cell, plasma cell, mast cell, myeloid leukocyte and glial cells. I wonder if the cell type is too general since we attempt to use a novel method to identify cell types. Should I try to find the dataset with a more specific cell type annotation, for example one that distinguish CD8+ T, CD4, effector T helper cells, CD16 monocytes and etc.

stemangiola commented 3 years ago

Those are also helpful.

On Wed, 8 Sep 2021, 01:17 ZijieGA @.***> wrote:

Hi, @stemangiola https://github.com/stemangiola, Just a few questions in terms of the cell types, I found some datasets that contain ,T cell, B cell, plasma cell, mast cell, myeloid leukocyte and glial cells. I wonder if the cell type is too general since we attempt to use a novel method to identify cell types. Should I try to find the dataset with a more specific cell type annotation, for example one that distinguish CD8+ T, CD4, effector T helper cells, CD16 monocytes and etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stemangiola/single_cell_outliers/issues/1#issuecomment-914398479, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXF26V7G6UCEN6YDWDDK7DUAYUKNANCNFSM5BIAWKRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ZijieGA commented 3 years ago

Hi, when importing a dataset with a considerable size (a few hundreds Mb), is there any convenient way to preview the data frame? For example I have a txt.gz raw count data, it will take forever to preview if I use

library(readr) log_normalized_matrix_012020_txt <- read_table2("project/SCP256/data/log_normalized_matrix_012020.txt.gz") View(log_normalized_matrix_012020_txt)

stemangiola commented 3 years ago

Hi, when importing a dataset with a considerable size (a few hundreds Mb), is there any convenient way to preview the data frame? For example I have a txt.gz raw count data, it will take forever to preview if I use

For tables you could go to terminal and type


head path_to_very_big_table

stemangiola commented 3 years ago

Actually if it is compressed you can do in the terminal

less path_to_very_big_table

And press "q" for exiting the view

ZijieGA commented 3 years ago

create seurat object failed

I try to create a seurat object from a gigantic dataset (1.9G uncompressed) which in turn gives me a seurat object with zero features and small in size (a few Mb). I can confirm that the matrix contains the features. It seems the seurat object is not produced properly?

`> CreateSeuratObject(counts = counts, min.cells = 3, min.genes = 200, project = "SCP256")

SCP256seurat<-CreateSeuratObject(counts = counts, min.cells = 3, min.genes = 200, project = "SCP256") Warning message: In storage.mode(from) <- "double" : NAs introduced by coercion`

stemangiola commented 3 years ago

In storage.mode(from) <- "double" : NAs introduced by coercion`

What google says?

ZijieGA commented 3 years ago

Update on datasets

I have found 2 useful datasets with cell type annotations: ` bc_cells

# A Seurat-tibble abstraction: 100,064 × 6
# Features=29733 | Active assay=originalexp | Assays=originalexp
   cell                     orig.ident nCount_originale… nFeature_origin… Sample    Barcode  
   <chr>                    <fct>                  <dbl>            <int> <chr>     <chr>    
 1 CID3586_AAGACCTCAGCATGAG CID3586                 859.              564 project/… CID3586_…
 2 CID3586_AAGGTTCGTAGTACCT CID3586                 619.              276 project/… CID3586_…
 3 CID3586_ACCAGTAGTTGTGGCC CID3586                 444.              169 project/… CID3586_…
 4 CID3586_ACCCACTAGATGTCGG CID3586                 450.              182 project/… CID3586_…
 5 CID3586_ACTGATGGTCAACTGT CID3586                 589.              269 project/… CID3586_…
 6 CID3586_ACTTGTTAGGGAAACA CID3586                 597.              256 project/… CID3586_…
 7 CID3586_AGCAGCCTCCCTCTTT CID3586                 441.              169 project/… CID3586_…
 8 CID3586_AGCTTGATCGGCGCTA CID3586                 669.              302 project/… CID3586_…
 9 CID3586_ATCATCTAGGGATACC CID3586                 607.              272 project/… CID3586_…
10 CID3586_ATGGGAGAGGAGCGAG CID3586                 717.              330 project/… CID3586_…
# … with 100,054 more rows`

another one:

`> rc_cells
# A Seurat-tibble abstraction: 39,391 × 6
# Features=60627 | Active assay=originalexp | Assays=originalexp
   cell                 orig.ident    nCount_originale… nFeature_origina… Sample     Barcode 
   <chr>                <fct>                     <dbl>             <int> <chr>      <chr>   
 1 AAACCTGAGAATAGGG.p55 SeuratProject              1432               771 project/r… AAACCTG…
 2 AAACCTGAGGCTAGGT.p55 SeuratProject              1797               865 project/r… AAACCTG…
 3 AAACCTGCACTGTGTA.p55 SeuratProject              2071               987 project/r… AAACCTG…
 4 AAACCTGCAGTCCTTC.p55 SeuratProject               682               368 project/r… AAACCTG…
 5 AAACCTGGTAAATGTG.p55 SeuratProject              2915              1191 project/r… AAACCTG…
 6 AAACCTGGTACCGAGA.p55 SeuratProject              2933              1185 project/r… AAACCTG…
 7 AAACCTGGTGTGAAAT.p55 SeuratProject              4012              1314 project/r… AAACCTG…
 8 AAACCTGTCAGATAAG.p55 SeuratProject              2025               876 project/r… AAACCTG…
 9 AAACCTGTCCTGCTTG.p55 SeuratProject              1739               846 project/r… AAACCTG…
10 AAACCTGTCGCAAGCC.p55 SeuratProject              1197               640 project/r… AAACCTG…
# … with 39,381 more rows`

I will update the spreadsheet as well. I just wonder how to intergrade the the cell type annotation with the seurat object... Thanks

stemangiola commented 3 years ago

I just wonder how to intergrade the the cell type annotation with the seurat object... Thanks

If cell_type annotation is within a table, you can do

counts %>% left_join(annotation_tabe, by="cell")

make sure the cell IDs of the two tables coincide.

stemangiola commented 2 years ago

Hello @ZijieGA @Kirito-Ma please don't use for variable names or file/directory names abbreviations, or words that are not in the English dictionary (except for IDs)

For example, could you change the file

SCP1039_bc_cells

with whatever bc means?

Thanks

ZijieGA commented 2 years ago

Hello @ZijieGA @Kirito-Ma please don't use for variable names or file/directory names abbreviations, or words that are not in the English dictionary (except for IDs)

For example, could you change the file
SCP1039_bc_cells
with whatever bc means?

Thanks

changed

ZijieGA commented 2 years ago

Hi, I have found a useful dataset along with clustering and cell type files. However the data frame of the matrix does not contain cell names (e.g. barcodes) but a series of number :0, 1, 2...2171.

> > SCP1244
# A tibble: 45,895 × 2,171
   GENE         `0`   `1`   `2`   `3`   `4`   `5`   `6`    `7`   `8`   `9`  `10`  `11`  `12`
   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 AL356585.2     0     0     0     0     0     0     0   0        0     0     0     0     0
 2 CU638689.1     0     0     0     0     0     0     0   0        0     0     0     0     0
 3 CU638689.2     0     0     0     0     0     0     0   0        0     0     0     0     0
 4 CU634019.1     0     0     0     0     0     0     0 492.       0     0     0     0     0
 5 CU633906.1     0     0     0     0     0     0     0   8.83     0     0     0     0     0
 6 FP236241.1     0     0     0     0     0     0     0 467.       0     0     0     0     0
 7 CU634019.7     0     0     0     0     0     0     0   0        0     0     0     0     0
 8 FP236383.…     0     0     0     0     0     0     0   0        0     0     0     0     0
 9 FP236383.4     0     0     0     0     0     0     0   0        0     0     0     0     0
10 FP671120.9     0     0     0     0     0     0     0   0        0     0     0     0     0
# … with 45,885 more rows, and 2,157 more variables: 13 <dbl>, 14 <dbl>, 15 <dbl>,
#   16 <dbl>, 17 <dbl>, 18 <dbl>, 19 <dbl>, 20 <dbl>, 21 <dbl>, 22 <dbl>, 23 <dbl>,
#   24 <dbl>, 25 <dbl>, 26 <dbl>, 27 <dbl>, 28 <dbl>, 29 <dbl>, 30 <dbl>, 31 <dbl>,
#   32 <dbl>, 33 <dbl>, 34 <dbl>, 35 <dbl>, 36 <dbl>, 37 <dbl>, 38 <dbl>, 39 <dbl>,
#   40 <dbl>, 41 <dbl>, 42 <dbl>, 43 <dbl>, 44 <dbl>, 45 <dbl>, 46 <dbl>, 47 <dbl>,
#   48 <dbl>, 49 <dbl>, 50 <dbl>, 51 <dbl>, 52 <dbl>, 53 <dbl>, 54 <dbl>, 55 <dbl>,
#   56 <dbl>, 57 <dbl>, 58 <dbl>, 59 <dbl>, 60 <dbl>, 61 <dbl>, 62 <dbl>, 63 <dbl>, …

whereas the meta looks like

> meta
# A tibble: 2,171 × 6
   NAME  biosample_id `cluster dominant cell type` `supercluster for L… X          Y        
   <chr> <chr>        <chr>                        <chr>                <chr>      <chr>    
 1 TYPE  group        group                        group                numeric    numeric  
 2 0     01115149-TC  prostate cancer cell         prostate cancer      -9.846066… 15.95937…
 3 1     01115149-TC  plasmablast                  B lineage            -5.832184… -11.4710…
 4 2     01115149-TC  prostate cancer cell         prostate cancer      -9.804745… 15.91849…
 5 3     01115149-TC  prostate cancer cell         prostate cancer      -9.771316… 15.88396…
 6 4     01115149-TC  CD4+ T cell                  NK/T                 6.4992606… -2.97978…
 7 5     01115149-TC  CD4+ T cell                  NK/T                 7.4751656… -1.72243…
 8 6     01115149-TC  CD8+ CXCR4+ T cell           NK/T                 5.0847661… -5.53409…
 9 7     01115149-TC  CD8+ CXCR4+ T cell           NK/T                 5.2157208… -5.70426…
10 8     01115149-TC  CD4+ T cell                  NK/T                 8.1365253… -1.66811…
# … with 2,161 more rows

It seems that in this case I cannot leftjoin the two dataset. Is there any easy way to combine both? Once I combine the two, it can be uploaded and ready to use.

stemangiola commented 2 years ago

It seems that in this case I cannot leftjoin the two dataset. Is there any easy way to combine both? Once I combine the two, it can be uploaded and ready to use.

Integer numbers are fine IDs if they are unique. I see NAME column if an integer number and column names of counts are integer numbers. Why can't you left_join them?

ZijieGA commented 2 years ago

It seems that in this case I cannot leftjoin the two dataset. Is there any easy way to combine both? Once I combine the two, it can be uploaded and ready to use.

Integer numbers are fine IDs if they are unique. I see NAME column if an integer number and column names of counts are integer numbers. Why can't you left_join them?

In this matrix, cells are in columns whereas the dataset processed previously are in rows. I tried using t(x) to switch the column and rows but did not work out

stemangiola commented 2 years ago

In this matrix, cells are in columns whereas the dataset processed previously are in rows. I tried using t(x) to switch the column and rows but did not work out

You have to use CreasteSeuratObject from matrix

stemangiola commented 2 years ago

@ZijieGA a new interesting dataset for you

[ ] https://genome.cshlp.org/content/early/2021/09/21/gr.273300.120

stemangiola commented 2 years ago

For Records

breast single-cell and spatial GSE176078

stemangiola / single_cell_outliers