Datasets - Githubissues

alex-d13 commented 3 years ago

Current status of datasets: https://docs.google.com/document/d/1a8uu0-GclIa9yy2Hs_AkoSJnMEOHLZkQdmCzwWlueoQ/edit#

Test-cases suggested by Francesca:

Mouse data: it would be a completely new application that we should develop as soon as possible, and there are already bulk RNA-seq + FACS data for independent validation that we could start in Innsbruck next September/October. In this respect, the Tabula Muris data seems a great dataset to start with, as it has both Smart-seq2 and 10X data (correct!?). GregorSturm what do you think about the quality/resolution of the available annotation? Are the raw data easily accessible?
Human TIL data: Gregor has already done quite some work on the Zemin Zhang dataset, a collaborator of mine could help us with the validation, and it would be a nice way to test the ability of the tools to disentangle closely related cell types. Also,10x data is also available for building the signatures (see Szabo et al. and Cano-Gomez et al.).
Human lung-cancer data: raw data from the Maynard study are readily available and Gregor has already done quite some work with annotation.
Human glioblastoma (or glioma or brain cancer) data: there are some Smart-seq data available (see Table 1 in this paper -- Alex, could you check for raw data availability?) and I have a collaborator that could help us with the validation.

alex-d13 commented 3 years ago

to point 4: i looked at the datasets, only one is completely available (Darmanis et al CReports: http://gbmseq.org/); I am not sure if this annotation is detailed enough? For the other 3:

Tirosh et al Science: fastq available, no annotation (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP062392&o=acc_s%3Aa)
Venteicher et al Science: did not find access to raw data (https://www.ncbi.nlm.nih.gov/bioproject/352580)
Neftel et al Cell: they have issues with uploading their data? (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131928)

grst commented 3 years ago

Here's the dataset with ERCC spike-ins we discussed today (Travaglini et al 2020):

healthy lung from lung tumor patient
3 patients
matched 10x sequencing available
9.4k cells (Smartseq2)
paper doi:10.1038/s41586-020-2922-4
cells are sorted (into epithelial (EPCAM+), endothelial/immune (CD31+CD45+) and stromal (EPCAM−CD31−CD45−))

Will send you a download-link to the preprocessed files via email.

grst commented 3 years ago

Tabula sapiens: https://tabula-sapiens-portal.ds.czbiohub.org/whereisthedata https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219

omnideconv / SimBu

Datasets #1