Open stemangiola opened 3 years ago
Hello @Kirito-Ma ,
this is the rought method to look for data
1) briefly read the article 2) look for a github repository 3) read the README of the github repository 4) see if there is code to reproduce the analyses 5) look at that code and see if there is data that includes transcript abundance 6) is yes download the code+ files, unzip, take what you need 7) if not, look in the paper for data availability 8) go to that repository, and try to understand what type of data is (e.g. raw counts) 9) past the link in this issue, so we start to see how much data is available
What data do we need.
1) a table with gene ID as row names, cell ID as column names, and transcript abundance (> 0, < 10000) as values. 2) a table with cell ID as row names, and cell type (T cell) as a column, and sample ID as a column, and factor of interest as column (e.e. healthy and cancer, knock-out vs wild-type).
How to contribute to another study, find the data with this shape
1) a table with sample ID as row names, cell type (T cell) as column names, counts (> 0, < 1000) as value (these are the number of cells in a sample for a cell type). 2) a table with sample ID as row names, factor of interest as column
You are interested to those who have some sample design, where they are testing differences between conditions.
[ ] CellMarker: a manually curated resource of cell markers in human and mouse
[ ] scRNAseq bioc package Gene-level counts for a collection of public scRNA-seq datasets, provided as SingleCellExperiment objects with cell- and gene-level metadata.
[ ] EMBL-EBI atlas
[ ] (PanglaoDB)[https://panglaodb.se/) is a database for the scientific community interested in exploration of single cell RNA sequencing experiments from mouse and human. We collect and integrate data from multiple studies and present them through a unified framework.
[ ] scRNASeqDBdatabase, which contains 36 human single cell gene expression data sets collected from Gene Expression Omnibus (GEO)
[ ] JingleBellA repository of standardized single cell RNA-Seq datasets for analysis and visualization at the single cell level.
[ ] The conquer (consistent quantification of external rna-seq data) repository is developed by Charlotte Soneson and Mark D Robinson at the University of Zurich, Switzerland. It is implemented in shiny and provides access to consistently processed public single-cell RNA-seq data sets.
[ ] A curated database reveals trends in single cell transcriptomics Valentine Svensson, Eduardo da Veiga Beltrame bioRxiv 742304; doi: https://doi.org/10.1101/742304
Hello @Kirito-Ma you downloaded 3 datasets but you ticked just one, could you update the ticks?
Hi Stemangiola/Single_cell_outliers, I have updated my ticks. Thanks for reminding me.
On Fri, Aug 27, 2021 at 10:54 AM Stefano Mangiola @.***> wrote:
Hello @Kirito-Ma https://github.com/Kirito-Ma you downloaded 3 datasets but you ticked just one, could you update the ticks?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stemangiola/single_cell_outliers/issues/1#issuecomment-906885058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU2ODENL7DETDGZVGVSSLZ3T6344FANCNFSM5BIAWKRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@ZijieGA this is a good one for you https://satijalab.org/seurat/articles/multimodal_reference_mapping.html
We load the reference (https://atlas.fredhutch.org/data/nygc/multimodal/pbmc_multimodal.h5seurat)
remotes::install_github("mojaveazure/seurat-disk")
library(Seurat)
library(SeuratDisk)
reference <- LoadH5Seurat("../../pbmc_multimodal.h5seurat")
DimPlot(object = reference, reduction = "wnn.umap", group.by = "celltype.l2", label = TRUE, label.size = 3, repel = TRUE) + NoLegend()
@Kirito-Ma @ZijieGA Please compile https://docs.google.com/spreadsheets/d/1En7-UV0k0laDiIfjFkdn7dggyR7jIk3WH8QgXaMOZF0/edit#gid=0
(this is a huge database of single-cell studies, just for your knowledge https://docs.google.com/spreadsheets/d/17Z5j_Oxd21IEyQ1qZ_vXq9FBpG4YrFAq9naEifBEuFw/edit?usp=sharing)
Here some human blood datasets I added to your spreadsheet in another tab "available"
Bring datasets to a common format
sample | cell_type | cell_cluster | dataset_id
Hi, @stemangiola, Just a few questions in terms of the cell types, I found some datasets that contain ,T cell, B cell, plasma cell, mast cell, myeloid leukocyte and glial cells. I wonder if the cell type is too general since we attempt to use a novel method to identify cell types. Should I try to find the dataset with a more specific cell type annotation, for example one that distinguish CD8+ T, CD4, effector T helper cells, CD16 monocytes and etc.
Those are also helpful.
On Wed, 8 Sep 2021, 01:17 ZijieGA @.***> wrote:
Hi, @stemangiola https://github.com/stemangiola, Just a few questions in terms of the cell types, I found some datasets that contain ,T cell, B cell, plasma cell, mast cell, myeloid leukocyte and glial cells. I wonder if the cell type is too general since we attempt to use a novel method to identify cell types. Should I try to find the dataset with a more specific cell type annotation, for example one that distinguish CD8+ T, CD4, effector T helper cells, CD16 monocytes and etc.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stemangiola/single_cell_outliers/issues/1#issuecomment-914398479, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXF26V7G6UCEN6YDWDDK7DUAYUKNANCNFSM5BIAWKRQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Hi, when importing a dataset with a considerable size (a few hundreds Mb), is there any convenient way to preview the data frame? For example I have a txt.gz raw count data, it will take forever to preview if I use
library(readr) log_normalized_matrix_012020_txt <- read_table2("project/SCP256/data/log_normalized_matrix_012020.txt.gz") View(log_normalized_matrix_012020_txt)
Hi, when importing a dataset with a considerable size (a few hundreds Mb), is there any convenient way to preview the data frame? For example I have a txt.gz raw count data, it will take forever to preview if I use
For tables you could go to terminal and type
head path_to_very_big_table
Actually if it is compressed you can do in the terminal
less path_to_very_big_table
And press "q" for exiting the view
I try to create a seurat object from a gigantic dataset (1.9G uncompressed) which in turn gives me a seurat object with zero features and small in size (a few Mb). I can confirm that the matrix contains the features. It seems the seurat object is not produced properly?
`> CreateSeuratObject(counts = counts, min.cells = 3, min.genes = 200, project = "SCP256")
SCP256seurat<-CreateSeuratObject(counts = counts, min.cells = 3, min.genes = 200, project = "SCP256") Warning message: In storage.mode(from) <- "double" : NAs introduced by coercion`
In storage.mode(from) <- "double" : NAs introduced by coercion`
What google says?
I have found 2 useful datasets with cell type annotations: ` bc_cells
# A Seurat-tibble abstraction: 100,064 × 6
# Features=29733 | Active assay=originalexp | Assays=originalexp
cell orig.ident nCount_originale… nFeature_origin… Sample Barcode
<chr> <fct> <dbl> <int> <chr> <chr>
1 CID3586_AAGACCTCAGCATGAG CID3586 859. 564 project/… CID3586_…
2 CID3586_AAGGTTCGTAGTACCT CID3586 619. 276 project/… CID3586_…
3 CID3586_ACCAGTAGTTGTGGCC CID3586 444. 169 project/… CID3586_…
4 CID3586_ACCCACTAGATGTCGG CID3586 450. 182 project/… CID3586_…
5 CID3586_ACTGATGGTCAACTGT CID3586 589. 269 project/… CID3586_…
6 CID3586_ACTTGTTAGGGAAACA CID3586 597. 256 project/… CID3586_…
7 CID3586_AGCAGCCTCCCTCTTT CID3586 441. 169 project/… CID3586_…
8 CID3586_AGCTTGATCGGCGCTA CID3586 669. 302 project/… CID3586_…
9 CID3586_ATCATCTAGGGATACC CID3586 607. 272 project/… CID3586_…
10 CID3586_ATGGGAGAGGAGCGAG CID3586 717. 330 project/… CID3586_…
# … with 100,054 more rows`
another one:
`> rc_cells
# A Seurat-tibble abstraction: 39,391 × 6
# Features=60627 | Active assay=originalexp | Assays=originalexp
cell orig.ident nCount_originale… nFeature_origina… Sample Barcode
<chr> <fct> <dbl> <int> <chr> <chr>
1 AAACCTGAGAATAGGG.p55 SeuratProject 1432 771 project/r… AAACCTG…
2 AAACCTGAGGCTAGGT.p55 SeuratProject 1797 865 project/r… AAACCTG…
3 AAACCTGCACTGTGTA.p55 SeuratProject 2071 987 project/r… AAACCTG…
4 AAACCTGCAGTCCTTC.p55 SeuratProject 682 368 project/r… AAACCTG…
5 AAACCTGGTAAATGTG.p55 SeuratProject 2915 1191 project/r… AAACCTG…
6 AAACCTGGTACCGAGA.p55 SeuratProject 2933 1185 project/r… AAACCTG…
7 AAACCTGGTGTGAAAT.p55 SeuratProject 4012 1314 project/r… AAACCTG…
8 AAACCTGTCAGATAAG.p55 SeuratProject 2025 876 project/r… AAACCTG…
9 AAACCTGTCCTGCTTG.p55 SeuratProject 1739 846 project/r… AAACCTG…
10 AAACCTGTCGCAAGCC.p55 SeuratProject 1197 640 project/r… AAACCTG…
# … with 39,381 more rows`
I will update the spreadsheet as well. I just wonder how to intergrade the the cell type annotation with the seurat object... Thanks
I just wonder how to intergrade the the cell type annotation with the seurat object... Thanks
If cell_type annotation is within a table, you can do
counts %>% left_join(annotation_tabe, by="cell")
make sure the cell IDs of the two tables coincide.
Hello @ZijieGA @Kirito-Ma please don't use for variable names or file/directory names abbreviations, or words that are not in the English dictionary (except for IDs)
For example, could you change the file
SCP1039_bc_cells
with whatever bc means?
Thanks
Hello @ZijieGA @Kirito-Ma please don't use for variable names or file/directory names abbreviations, or words that are not in the English dictionary (except for IDs)
For example, could you change the file
SCP1039_bc_cells
with whatever bc means?
Thanks
changed
Hi, I have found a useful dataset along with clustering and cell type files. However the data frame of the matrix does not contain cell names (e.g. barcodes) but a series of number :0, 1, 2...2171.
> > SCP1244
# A tibble: 45,895 × 2,171
GENE `0` `1` `2` `3` `4` `5` `6` `7` `8` `9` `10` `11` `12`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AL356585.2 0 0 0 0 0 0 0 0 0 0 0 0 0
2 CU638689.1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 CU638689.2 0 0 0 0 0 0 0 0 0 0 0 0 0
4 CU634019.1 0 0 0 0 0 0 0 492. 0 0 0 0 0
5 CU633906.1 0 0 0 0 0 0 0 8.83 0 0 0 0 0
6 FP236241.1 0 0 0 0 0 0 0 467. 0 0 0 0 0
7 CU634019.7 0 0 0 0 0 0 0 0 0 0 0 0 0
8 FP236383.… 0 0 0 0 0 0 0 0 0 0 0 0 0
9 FP236383.4 0 0 0 0 0 0 0 0 0 0 0 0 0
10 FP671120.9 0 0 0 0 0 0 0 0 0 0 0 0 0
# … with 45,885 more rows, and 2,157 more variables: 13 <dbl>, 14 <dbl>, 15 <dbl>,
# 16 <dbl>, 17 <dbl>, 18 <dbl>, 19 <dbl>, 20 <dbl>, 21 <dbl>, 22 <dbl>, 23 <dbl>,
# 24 <dbl>, 25 <dbl>, 26 <dbl>, 27 <dbl>, 28 <dbl>, 29 <dbl>, 30 <dbl>, 31 <dbl>,
# 32 <dbl>, 33 <dbl>, 34 <dbl>, 35 <dbl>, 36 <dbl>, 37 <dbl>, 38 <dbl>, 39 <dbl>,
# 40 <dbl>, 41 <dbl>, 42 <dbl>, 43 <dbl>, 44 <dbl>, 45 <dbl>, 46 <dbl>, 47 <dbl>,
# 48 <dbl>, 49 <dbl>, 50 <dbl>, 51 <dbl>, 52 <dbl>, 53 <dbl>, 54 <dbl>, 55 <dbl>,
# 56 <dbl>, 57 <dbl>, 58 <dbl>, 59 <dbl>, 60 <dbl>, 61 <dbl>, 62 <dbl>, 63 <dbl>, …
whereas the meta looks like
> meta
# A tibble: 2,171 × 6
NAME biosample_id `cluster dominant cell type` `supercluster for L… X Y
<chr> <chr> <chr> <chr> <chr> <chr>
1 TYPE group group group numeric numeric
2 0 01115149-TC prostate cancer cell prostate cancer -9.846066… 15.95937…
3 1 01115149-TC plasmablast B lineage -5.832184… -11.4710…
4 2 01115149-TC prostate cancer cell prostate cancer -9.804745… 15.91849…
5 3 01115149-TC prostate cancer cell prostate cancer -9.771316… 15.88396…
6 4 01115149-TC CD4+ T cell NK/T 6.4992606… -2.97978…
7 5 01115149-TC CD4+ T cell NK/T 7.4751656… -1.72243…
8 6 01115149-TC CD8+ CXCR4+ T cell NK/T 5.0847661… -5.53409…
9 7 01115149-TC CD8+ CXCR4+ T cell NK/T 5.2157208… -5.70426…
10 8 01115149-TC CD4+ T cell NK/T 8.1365253… -1.66811…
# … with 2,161 more rows
It seems that in this case I cannot leftjoin the two dataset. Is there any easy way to combine both? Once I combine the two, it can be uploaded and ready to use.
It seems that in this case I cannot leftjoin the two dataset. Is there any easy way to combine both? Once I combine the two, it can be uploaded and ready to use.
Integer numbers are fine IDs if they are unique. I see NAME column if an integer number and column names of counts are integer numbers. Why can't you left_join them?
It seems that in this case I cannot leftjoin the two dataset. Is there any easy way to combine both? Once I combine the two, it can be uploaded and ready to use.
Integer numbers are fine IDs if they are unique. I see NAME column if an integer number and column names of counts are integer numbers. Why can't you left_join them?
In this matrix, cells are in columns whereas the dataset processed previously are in rows. I tried using t(x) to switch the column and rows but did not work out
In this matrix, cells are in columns whereas the dataset processed previously are in rows. I tried using t(x) to switch the column and rows but did not work out
You have to use CreasteSeuratObject from matrix
@ZijieGA a new interesting dataset for you
For Records
breast single-cell and spatial GSE176078
If you dig in their github repository you will likely find the data already summarised for you in form of R script for reproducibility, so you have to avoid the 95% of work in getting raw single-cell data.
Used by this article: https://www.pnas.org/content/118/22/e2100293118
Segerstolpe A, Palasantza A, Eliasson P, Andersson EM, Andreasson AC, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, Smith DM, Kasper M, Ammala C, Sandberg R. Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab. 2016; 24(4):593–607.
Delile J, Rayon T, Melchionda M, Edwards A, Briscoe J, Sagner A. Single cell transcriptomics reveals spatial and temporal dynamics of gene expression in the developing mouse spinal cord. Development. 2019. https://doi.org/10.1242/dev.173807.
Used by this article: https://www.pnas.org/content/118/22/e2100293118
M. Sade-Feldman et al., Defining t cell states associated with response to checkpoint immunotherapy in melanoma.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120575
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122043
K. Gupta et al., Single-cell analysis reveals a hair follicle dermal niche molecular differentiation trajectory that begins prior to morphogenesis. Dev. Cell 48, 17–31 (2019).
X. Fan et al., Single cell and open chromatin analysis reveals molecular origin of epidermal cells of the skin. Dev. Cell 47, 21–37 (2018).
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102086
https://ndownloader.figshare.com/files/22927382
R. L. Chua et al., Covid-19 severity correlates with airway epithelium–immune cell interactions identified by single-cell analysis. Nat. Biotechnol. 38, 970–979 (2020).
M. Liao et al., Single-cell landscape of bronchoalveolar immune cells in patients with covid-19. Nat. Med. 26, 842–844 (2020)
https://cells.ucsc.edu/covid19-balf/nCoV.rds
https://singlecell.broadinstitute.org/single_cell/study/SCP263/aging-mouse-brain#/
M. Ximerakis et al., Single-cell transcriptomic profiling of the aging mouse brain. Nat. Neurosci. 22, 1696–1708 (2019).
Used by this article: https://www.biorxiv.org/content/10.1101/2020.12.14.422688v1.full
https://singlecell.broadinstitute.org/single_cell/study/SCP259
https://github.com/zhangzlab/covid_balf
https://singlecell.broadinstitute.org/single_cell/study/SCP44