Open hammer opened 6 years ago
WEHI has separated out human and mouse data and made files available for loading into R.
Unfortunately they use data from MSigDB 5.2; 6.1 appears to be the latest.
Of course the Broad invented a new format, the .gmt format.
I'll be using Bioconductor's GSEABase to see if I can explore this data.
Note that Bioconductor wants rJava, and you'll need to R CMD javareconf
from the terminal to get rJava to recognize Homebrew's JVM.
Also, apparently GSEABase can't read .gmt files. I found read.gmt
in clusterProfiler though. Maybe that will work?
...and sure, it does work, but .gmt files do not have enough information--just the experiment name and a list of gene IDs. It is useful to run queries over experiment names though, e.g.
c7 <- as_tibble(read.gmt("/Users/hammer/Desktop/c7.all.v6.1.entrez.gmt"))
c7 %>% distinct(ont) %>% count() # 4872
c7 %>% filter(str_detect(ont, 'TCELL') | str_detect(ont, 'CD3') | str_detect(ont, 'CD4') | str_detect(ont, 'CD8')) %>% distinct(ont) %>% count # 1736
c7 %>% filter(str_detect(ont, 'TCELL')) %>% distinct(ont) %>% count() # 1504
c7 %>% filter(str_detect(ont, 'CD3')) %>% distinct(ont) %>% count() # 60
c7 %>% filter(str_detect(ont, 'CD4')) %>% distinct(ont) %>% count() # 1024
c7 %>% filter(str_detect(ont, 'CD8')) %>% distinct(ont) %>% count() # 598
I think what I want is "ZIPped MSigDB v6.1 file set" from http://software.broadinstitute.org/gsea/downloads.jsp, which contains msigdb_v6.1.xml
, which I believe has the metadata about the experiments. I shall try to parse with xml2, after first exploring from the CLI w/ xmlstarlet.
Ugh of course this XML file is just a table and could have been a TSV:
$ xml el -a ~/Desktop/msigdb_v6.1.xml
MSIGDB/GENESET/@STANDARD_NAME
MSIGDB/GENESET/@SYSTEMATIC_NAME
MSIGDB/GENESET/@HISTORICAL_NAMES
MSIGDB/GENESET/@ORGANISM
MSIGDB/GENESET/@PMID
MSIGDB/GENESET/@AUTHORS
MSIGDB/GENESET/@GEOID
MSIGDB/GENESET/@EXACT_SOURCE
MSIGDB/GENESET/@GENESET_LISTING_URL
MSIGDB/GENESET/@EXTERNAL_DETAILS_URL
MSIGDB/GENESET/@CHIP
MSIGDB/GENESET/@CATEGORY_CODE
MSIGDB/GENESET/@SUB_CATEGORY_CODE
MSIGDB/GENESET/@CONTRIBUTOR
MSIGDB/GENESET/@CONTRIBUTOR_ORG
MSIGDB/GENESET/@DESCRIPTION_BRIEF
MSIGDB/GENESET/@DESCRIPTION_FULL
MSIGDB/GENESET/@TAGS
MSIGDB/GENESET/@MEMBERS
MSIGDB/GENESET/@MEMBERS_SYMBOLIZED
MSIGDB/GENESET/@MEMBERS_EZID
MSIGDB/GENESET/@MEMBERS_MAPPING
MSIGDB/GENESET/@FOUNDER_NAMES
MSIGDB/GENESET/@REFINEMENT_DATASETS
MSIGDB/GENESET/@VALIDATION_DATASETS
MSIGDB/GENESET/@CATEGORY_CODE
is C7
for ImmuneSigDB; to confirm:
$ xml sel -t -v "count(.//GENESET[contains(@CATEGORY_CODE,'C7')])" ~/Desktop/msigdb_v6.1.xml
4872
So in R
library(xml2)
x <- read_xml("~/Desktop/msigdb_v6.1.xml")
c7 <- xml_find_all(x, "//GENESET[contains(@CATEGORY_CODE,'C7')]")
as_tibble(xml_attr(c7, "ORGANISM")) %>% group_by(value) %>% count(value) # Homo sapiens 1888, Mus musculus 2984
as_tibble(xml_attr(c7, "CONTRIBUTOR")) %>% group_by(value) %>% count(value) # Jernej Godec 4872
as_tibble(xml_attr(c7, "PMID")) %>% distinct(value) %>% count() # 363
as_tibble(xml_attr(c7, "STANDARD_NAME")) %>% filter(str_detect(value, 'TCELL') | str_detect(value, 'CD3') | str_detect(value, 'CD4') | str_detect(value, 'CD8')) %>% count() # 1736
# TRULY HORRIFYING
c7.tibble <- tibble(lapply(lapply(as_list(c7), attributes), as_tibble)) %>% clean_names() %>% unnest(lapply_lapply_as_list_c7_attributes_as_tibble) %>% clean_names()
skim(c7.tibble)
c7.tibble <- c7.tibble %>% select(-c(category_code, contributor, contributor_org, external_details_url, founder_names, geneset_listing_url, historical_names, refinement_datasets, sub_category_code, tags, validation_datasets))
Now we can finally ask interesting questions
c7.tibble %>% filter(str_detect(standard_name, 'TCELL') | str_detect(standard_name, 'CD3') | str_detect(standard_name, 'CD4') | str_detect(standard_name, 'CD8')) %>% group_by(organism) %>% count() # Homo sapiens 612, Mus musculus 1124
# Human T cell publications
c7.tibble %>% filter(str_detect(standard_name, 'TCELL') | str_detect(standard_name, 'CD3') | str_detect(standard_name, 'CD4') | str_detect(standard_name, 'CD8')) %>% filter(organism == 'Homo sapiens') %>% group_by(pmid) %>% count() %>% arrange(desc(n))
14607935 90
20620947 66
15789058 48
16474395 48
15210650 24
22434910 20
15928199 18
16423401 18
19464196 18
19568420 16
For future efforts, here's a TSV of the MSigDB 6.1 XML metadata with only the C7 (i.e. ImmuneSigDB) gene sets: c7.tsv.zip
When loading, be sure to use col_types = cols(.default = "c")
to suppress issues with parsing the "None" PMID for a few gene sets (cf. https://github.com/tidyverse/readr/issues/148#issuecomment-142428658)
Decided to take a look at the data from the top PMID turned up in https://github.com/hammerlab/t-cell-review/issues/4#issuecomment-383351198, 14607935.
The PubMed page links to the associated GEO datasets (nice!). GDS1290 is just metadata, while GSE2770 is legit data. What's more, GEO2R will generate R code to import this data into R!
Had to BiocManager::install("GEOquery")
first, then noticed line 58 of the generated script forgot to add a #
before the set group names
inline comment (weird). All of the data gets shoved into a gnarly object called gset
of class ExpressionSet
. Can biobroom help? We'll have to find out later, I need to go work out.
So, w/ biobroom
:
gset_tidy <- tidy(gset)
gset_tidy %>% distinct(sample) %>% count()
We've lost a lot of metadata, and a lot of replicates; only GSM60348 to GSM60381 made it. I think we're going to have to tidy this thing ourselves.
After reviewing the GEO data model and how to work with S4 objects, I have a better handle on this data; gset_raw <- getGEO("GSE2770", GSEMatrix =TRUE, getGPL=FALSE)
pulls down the data into a list of 3 ExpressionSets, one element for each Platform: GPL96, GPL97, GPL8300. There are 34 samples from each platform.
To tidy up this data:
gset_raw <- getGEO("GSE2770", GSEMatrix =TRUE, getGPL=FALSE)
GSE2770_tidy_full <- gset_raw %>%
map(~ tidy(.x, addPheno = TRUE)) %>%
bind_rows()
GSE2770_tidy <- GSE2770_tidy_full %>%
select(gene, sample, title, value) %>%
separate(title, c("cells", "treatment", "time", "replicate", "platform"), sep="[_\\(\\)]") %>%
mutate(platform = replace_na(platform, "U95Av2")) %>%
select(-cells) %>%
mutate(replicate = map_chr(replicate, ~ str_split(.x, boundary("word"))[[1]][2])) %>%
mutate(time = as.duration(time))
Note that GSM60381
has platform
NA
after the first call to separate
; we hand-code U95Av2
based on examination of the GEO record. Another solution: pass platform during the bind_rows()
call. We could go further and separate
treatment
and replica
further. For now, let's move on.
The most important next steps: understand gene
and value
.
Some packages that may help
Annotations for the 3 platforms from this paper
Reading the ImmuneSigDB paper, they
mapped Affymetrix probe set identifiers to human gene symbols using the Collapse Dataset tool (max probe algorithm) of the GSEA program
In other words, add a column for HUGO gene, group on it, and keep the max value within the group.
Maybe fastest way to map, after install
/load
dance:
gpl3800_tib <- as_tibble(as.data.frame(hgu95av2SYMBOL))
gpl96_tib <- as_tibble(as.data.frame(hgu133aSYMBOL))
gpl97_tib <- as_tibble(as.data.frame(hgu133bSYMBOL))
probe2symbol <- bind_rows("U95Av2" = gpl3800_tib, "U133A" = gpl96_tib, "U133B" = gpl97_tib)
GSE2770_tidy_symbols <- GSE2770_tidy %>% left_join(probe2symbol, by = c("platform", "gene" = "probe_id"))
Here's another way to do it using biomaRt
; each platform requires 5+ minutes to return the results from the web service, so I don't love this solution. I could easily reduce code duplication w/ iteration through a list of platforms but decided to be explicit.
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
searchAttributes(ensembl, "affy") # affy_hg_u95av2, affy_hg_u133a, affy_hg_u133b
gpl8300_probes <- GSE2770_tidy %>% filter(platform == "U95Av2") %>% distinct(gene) %>% .$gene
gpl8300_genes <- getBM(attributes = c('affy_hg_u95av2', 'hgnc_symbol'), filters = 'affy_hg_u95av2', values = gpl8300_probes, mart = ensembl)
as_tibble(gpl8300_genes) %>% distinct(hgnc_symbol) %>% count() # 9,707 genes vs. 13,105 probes
as_tibble(as.data.frame(hgu95av2SYMBOL[gpl8300_probes])) %>% skim # 8,585 genes vs. 11,460 probes
gpl96_probes <- GSE2770_tidy %>% filter(platform == "U133A") %>% distinct(gene) %>% .$gene
gpl96_genes <- getBM(attributes = c('affy_hg_u133a', 'hgnc_symbol'), filters = 'affy_hg_u133a', values = gpl96_probes, mart = ensembl)
# 22,737 probes, 13,926 gene symbols
gpl97_probes <- GSE2770_tidy %>% filter(platform == "U133B") %>% distinct(gene) %>% .$gene
gpl97_genes <- getBM(attributes = c('affy_hg_u133b', 'hgnc_symbol'), filters = 'affy_hg_u133b', values = gpl97_probes, mart = ensembl)
Since some genes have multiple probes mapped to them, we need to take the max value from all probes mapped to a gene to be compatible with the ImmuneSigDB analysis.
# Find genes with most probes
GSE2770_tidy_symbols %>%
group_by(symbol) %>%
summarise(n_probes = n_distinct(gene)) %>%
arrange(desc(n_probes))
# Max probe "algorithm"
GSE2770_tidy_maxprobe <- GSE2770_tidy_symbols %>%
filter(!is.na(symbol)) %>%
group_by(sample, symbol) %>%
slice(which.max(value))
Before I bother with normalization, I thought it would be fun to make some plots, inspired by Figure 1 of the paper!
ifng <- GSE2770_tidy_maxprobe %>% filter(symbol == "IFNG")
ggplot(ifng, aes(x = as.duration(time)/dhours(1), y = value, color = treatment)) + geom_point() + theme_bw()
# repeat for IL4, T-bet (can't find in data! tried TBX21, TBLYM, TBET), GATA3
# mygene.info "reporter" field doesn't have probes for TBX21 in U95A, U133A, or U133B!
ImmuneSigDB normalized CEL files with justRMA
from affy. Given that the CEL files are available for this paper, I should probably repeat this analysis with that data.
More information on why I couldn't find TBX21 in the previous analysis:
More information on the top 10 T cell gene signatures turned up in https://github.com/hammerlab/t-cell-review/issues/4#issuecomment-383351198:
Okay time to work with CEL files!
files.df = getGEOSuppFiles("GSE2770") # puts file(s) into GSE2770/"
files = rownames(files.df) # just one .tar file
untar(files[1], exdir = "GSE2770/cels") # cel files still .gz
cels <- path_filter(dir_ls("GSE2770/cels"), glob = "*.gz")
ab <- ReadAffy(filenames=cels, compress=TRUE, verbose=TRUE) # Cel file GSE2770/cels/GSM60699.CEL.gz does not seem to have the correct dimensions (crap)
Okay, I am going to have to read and normalize the CEL files from the 3 different platforms separately. It's not obvious that differential expression analysis across platforms is a good idea, cf. Cross-study validation and combined analysis of gene expression microarray data (2008)
To get the .chip
files, I'll need an FTP client in R. httr
doesn't do ftp, ropensci/ftp is still half-baked, so I could use download.file()
from base R or RCurl. Actually I think curl is best!
Let's just get the CEL files used to generate the GSE2770_IL12_AND_TGFB_ACT_VS_ACT_CD4_TCELL_6H_DN gene set.
First, I'll need to figure out which samples to use as input. I'm going to try working with the phenotype data from GEO. This involves interacting w/ ExpressionSet and AnnotatedDataFrame objects from Bioconductor, sadly.
GSE2770 <- getGEO("GSE2770") # There are 3 elements in this list, 1 for each platform
pGSE2770 <- GSE2770 %>%
purrr::map(~ as(pData(.x), "data.frame")) %>%
bind_rows() %>%
as.tibble()
pGSE2770tidy <- pGSE2770 %>%
select(platform_id, geo_accession, supplementary_file, title) %>%
separate(title, c("cells", "treatment", "time", "replicate", "platform"), sep="[_\\(\\)]") %>%
select(-cells) %>%
mutate(replicate = map_chr(replicate, ~ str_split(.x, boundary("word"))[[1]][2])) %>%
mutate(time = as.duration(time))
# get just the U133A cels
cels <- pGSE2770tidy %>%
filter(time == duration("6h") & (treatment %in% c("antiCD3+antiCD28+IL12+TGFbeta", "antiCD3+antiCD28")) & platform == "U133A") %>%
.$supplementary_file
ab <- ReadAffy(filenames=path("GSE2770/cels", cels, ext = "CEL.gz"), compress=TRUE, verbose=TRUE)
eset <- rma(ab)
treatments <- factor(c(1,2,2,1),labels=c("untreated", "treated")) # weird ordering of files
design <- model.matrix(~treatments)
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(hgu133a.db, featureNames(eset), c("ENSEMBL","SYMBOL","GENENAME"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
results <- topTable(fit2, number=Inf)
# Not sure if this is UP or DOWN but it doesn't intersect much with either one!
results %>%
filter(!is.na(SYMBOL)) %>%
arrange(desc(logFC)) %>%
head(200) %>%
arrange(SYMBOL) %>%
.$SYMBOL
Okay let's try to figure out why there's no overlap.
U133A.chip <- read_tsv(curl("ftp://ftp.broadinstitute.org/pub/gsea/annotations/HG_U133A.chip"))
U133A.chip <- U133A.chip %>%
select(-X4) %>%
clean_names() %>%
rowwise() %>%
mutate(first_gene_symbol = str_trim(str_split(gene_symbol, "///")[[1]][1]))
eset_tib <- as_tibble(exprs(eset), rownames = "probe_set_id")
eset_tib_genes <- eset_tib %>% left_join(U133A.chip, by = "probe_set_id")
# Remove probesets that don't map to a gene
eset_tib_genes %>% filter(is.na(first_gene_symbol)) %>% count() # 0
eset_tib_genes %>% filter(first_gene_symbol == "---") %>% count() # 1,109
eset_tib_genes <- eset_tib_genes %>% filter(first_gene_symbol != "---")
# Collapse probesets with max probe algorithm
eset_tib_genes_only <- eset_tib_genes %>%
select(-c(probe_set_id, gene_symbol, gene_title)) %>%
group_by(first_gene_symbol) %>%
summarise_all(max)
# Make a new ExpressionSet object
eset_genes <- eset_tib_genes_only %>%
as.data.frame() %>%
column_to_rownames(var="first_gene_symbol") %>%
data.matrix() %>%
ExpressionSet()
# Do DGE
fit_genes <- lmFit(eset_genes, design)
fit2_genes <- eBayes(fit_genes)
results_genes <- topTable(fit2_genes, number=Inf)
results_genes_tib <- as_tibble(results_genes, rownames = "gene")
results_genes_tib %>%
arrange(desc(logFC)) %>%
head(200) %>%
arrange(gene) %>%
.$gene
Downloading MSigDB XML files from the CLI:
curl -d "j_username=jeff.hammerbacher%40gmail.com&j_password=password" http://software.broadinstitute.org/gsea/j_spring_security_check -c - | curl -b @- -O http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/6.2/msigdb_v6.2.xml
cf. curl manpage:
-d, --data
(HTTP) Sends the specified data in a POST request to the HTTP server, in the same way that a browser does when a user has filled in an HTML form and presses the submit button. This will cause curl to pass the data to the server using the content-type application/x-www-form-urlencoded. Compare to -F, --form.
--data-raw is almost the same but does not have a special interpretation of the @ character. To post data purely binary, you should instead use the --data-binary option. To URL-encode the value of a form field you may use --data-urlencode.
If any of these options is used more than once on the same command line, the data pieces specified will be merged together with a separating &-symbol. Thus, using '-d name=daniel -d skill=lousy' would generate a post chunk that looks like 'name=daniel&skill=lousy'.
If you start the data with the letter @, the rest should be a file name to read the data from, or - if you want curl to read the data from stdin. Multiple files can also be specified. Posting data from a file named 'foobar' would thus be done with -d, --data @foobar. When --data is told to read from a file like that, carriage returns and newlines will be stripped out. If you don't want the @ character to have a special interpretation use --data-raw instead.
See also --data-binary and --data-urlencode and --data-raw. This option overrides -F, --form and -I, --head and -T, --upload-file.
-c, --cookie-jar
(HTTP) Specify to which file you want curl to write all cookies after a completed operation. Curl writes all cookies from its in-memory cookie storage to the given file at the end of operations. If no cookies are known, no data will be written. The file will be written using the Netscape cookie file format. If you set the file name to a single dash, "-", the cookies will be written to stdout.
This command line option will activate the cookie engine that makes curl record and use cookies. Another way to activate it is to use the -b, --cookie option.
If the cookie jar can't be created or written to, the whole curl operation won't fail or even report an error clearly. Using -v, --verbose will get a warning displayed, but that is the only visible feedback you get about this possibly lethal situation.
If this option is used several times, the last specified file name will be used.
-b, --cookie
(HTTP) Pass the data to the HTTP server in the Cookie header. It is supposedly the data previously received from the server in a "Set-Cookie:" line. The data should be in the format "NAME1=VALUE1; NAME2=VALUE2".
If no '=' symbol is used in the argument, it is instead treated as a filename to read previously stored cookie from. This option also activates the cookie engine which will make curl record incoming cookies, which may be handy if you're using this in combination with the -L, --location option or do multiple URL transfers on the same invoke. If the file name is exactly a minus ("-"), curl will instead the contents from stdin.
The file format of the file to read cookies from should be plain HTTP headers (Set-Cookie style) or the Netscape/Mozilla cookie file format.
The file specified with -b, --cookie is only used as input. No cookies will be written to the file. To store cookies, use the -c, --cookie-jar option.
Exercise caution if you are using this option and multiple transfers may occur. If you use the NAME1=VALUE1; format, or in a file use the Set-Cookie format and don't specify a domain, then the cookie is sent for any domain (even after redirects are followed) and cannot be modified by a server-set cookie. If the cookie engine is enabled and a server sets a cookie of the same name then both will be sent on a future transfer to that server, likely not what you intended. To address these issues set a domain in Set-Cookie (doing that will include sub domains) or use the Netscape format.
If this option is used several times, the last one will be used.
Users very often want to both read cookies from a file and write updated cookies back to a file, so using both -b, --cookie and -c, --cookie-jar in the same command line is common.
From the MCP-counter paper:
MCP datasets and tumor datasets from Affymetrix Human Genome U133 Plus 2.0, Human Genome 133A, and HuGene 1.0 ST arrays were normalized using the frozen robust multiarray average (fRMA) method, implemented in the fRMA R package (version 1.18.0)
Loos like frma will allow us to compare across array platforms.
Also known as MSigDB C7. Paper