hammerlab / t-cell-data

https://tcelldata.hammerlab.org
6 stars 1 forks source link

Exploring ImmuneSigDB #4

Open hammer opened 6 years ago

hammer commented 6 years ago

Also known as MSigDB C7. Paper

hammer commented 6 years ago

WEHI has separated out human and mouse data and made files available for loading into R.

Unfortunately they use data from MSigDB 5.2; 6.1 appears to be the latest.

Of course the Broad invented a new format, the .gmt format.

hammer commented 6 years ago

I'll be using Bioconductor's GSEABase to see if I can explore this data.

Note that Bioconductor wants rJava, and you'll need to R CMD javareconf from the terminal to get rJava to recognize Homebrew's JVM.

Also, apparently GSEABase can't read .gmt files. I found read.gmt in clusterProfiler though. Maybe that will work?

...and sure, it does work, but .gmt files do not have enough information--just the experiment name and a list of gene IDs. It is useful to run queries over experiment names though, e.g.

c7 <- as_tibble(read.gmt("/Users/hammer/Desktop/c7.all.v6.1.entrez.gmt"))
c7 %>% distinct(ont) %>% count() # 4872
c7 %>% filter(str_detect(ont, 'TCELL') | str_detect(ont, 'CD3') | str_detect(ont, 'CD4') | str_detect(ont, 'CD8')) %>% distinct(ont) %>% count # 1736
c7 %>% filter(str_detect(ont, 'TCELL')) %>% distinct(ont) %>% count() # 1504
c7 %>% filter(str_detect(ont, 'CD3')) %>% distinct(ont) %>% count() # 60
c7 %>% filter(str_detect(ont, 'CD4')) %>% distinct(ont) %>% count() # 1024
c7 %>% filter(str_detect(ont, 'CD8')) %>% distinct(ont) %>% count() # 598
hammer commented 6 years ago

I think what I want is "ZIPped MSigDB v6.1 file set" from http://software.broadinstitute.org/gsea/downloads.jsp, which contains msigdb_v6.1.xml, which I believe has the metadata about the experiments. I shall try to parse with xml2, after first exploring from the CLI w/ xmlstarlet.

Ugh of course this XML file is just a table and could have been a TSV:

$ xml el -a ~/Desktop/msigdb_v6.1.xml
MSIGDB/GENESET/@STANDARD_NAME
MSIGDB/GENESET/@SYSTEMATIC_NAME
MSIGDB/GENESET/@HISTORICAL_NAMES
MSIGDB/GENESET/@ORGANISM
MSIGDB/GENESET/@PMID
MSIGDB/GENESET/@AUTHORS
MSIGDB/GENESET/@GEOID
MSIGDB/GENESET/@EXACT_SOURCE
MSIGDB/GENESET/@GENESET_LISTING_URL
MSIGDB/GENESET/@EXTERNAL_DETAILS_URL
MSIGDB/GENESET/@CHIP
MSIGDB/GENESET/@CATEGORY_CODE
MSIGDB/GENESET/@SUB_CATEGORY_CODE
MSIGDB/GENESET/@CONTRIBUTOR
MSIGDB/GENESET/@CONTRIBUTOR_ORG
MSIGDB/GENESET/@DESCRIPTION_BRIEF
MSIGDB/GENESET/@DESCRIPTION_FULL
MSIGDB/GENESET/@TAGS
MSIGDB/GENESET/@MEMBERS
MSIGDB/GENESET/@MEMBERS_SYMBOLIZED
MSIGDB/GENESET/@MEMBERS_EZID
MSIGDB/GENESET/@MEMBERS_MAPPING
MSIGDB/GENESET/@FOUNDER_NAMES
MSIGDB/GENESET/@REFINEMENT_DATASETS
MSIGDB/GENESET/@VALIDATION_DATASETS

MSIGDB/GENESET/@CATEGORY_CODE is C7 for ImmuneSigDB; to confirm:

$ xml sel -t -v "count(.//GENESET[contains(@CATEGORY_CODE,'C7')])" ~/Desktop/msigdb_v6.1.xml 
4872

So in R

library(xml2)
x <- read_xml("~/Desktop/msigdb_v6.1.xml")
c7 <- xml_find_all(x, "//GENESET[contains(@CATEGORY_CODE,'C7')]")
as_tibble(xml_attr(c7, "ORGANISM")) %>% group_by(value) %>% count(value) # Homo sapiens  1888, Mus musculus  2984
as_tibble(xml_attr(c7, "CONTRIBUTOR")) %>% group_by(value) %>% count(value) # Jernej Godec  4872
as_tibble(xml_attr(c7, "PMID")) %>% distinct(value) %>% count() # 363
as_tibble(xml_attr(c7, "STANDARD_NAME")) %>% filter(str_detect(value, 'TCELL') | str_detect(value, 'CD3') | str_detect(value, 'CD4') | str_detect(value, 'CD8')) %>% count() # 1736
# TRULY HORRIFYING
c7.tibble <- tibble(lapply(lapply(as_list(c7), attributes), as_tibble)) %>% clean_names() %>% unnest(lapply_lapply_as_list_c7_attributes_as_tibble) %>% clean_names()
skim(c7.tibble)
c7.tibble <- c7.tibble %>% select(-c(category_code, contributor, contributor_org, external_details_url, founder_names, geneset_listing_url, historical_names, refinement_datasets, sub_category_code, tags, validation_datasets))

Now we can finally ask interesting questions

c7.tibble %>% filter(str_detect(standard_name, 'TCELL') | str_detect(standard_name, 'CD3') | str_detect(standard_name, 'CD4') | str_detect(standard_name, 'CD8')) %>% group_by(organism) %>% count() # Homo sapiens 612, Mus musculus 1124
# Human T cell publications
c7.tibble %>% filter(str_detect(standard_name, 'TCELL') | str_detect(standard_name, 'CD3') | str_detect(standard_name, 'CD4') | str_detect(standard_name, 'CD8')) %>% filter(organism == 'Homo sapiens') %>% group_by(pmid) %>% count() %>% arrange(desc(n))
14607935    90
20620947    66
15789058    48
16474395    48
15210650    24
22434910    20
15928199    18
16423401    18
19464196    18
19568420    16
hammer commented 6 years ago

For future efforts, here's a TSV of the MSigDB 6.1 XML metadata with only the C7 (i.e. ImmuneSigDB) gene sets: c7.tsv.zip

When loading, be sure to use col_types = cols(.default = "c") to suppress issues with parsing the "None" PMID for a few gene sets (cf. https://github.com/tidyverse/readr/issues/148#issuecomment-142428658)

hammer commented 6 years ago

Decided to take a look at the data from the top PMID turned up in https://github.com/hammerlab/t-cell-review/issues/4#issuecomment-383351198, 14607935.

The PubMed page links to the associated GEO datasets (nice!). GDS1290 is just metadata, while GSE2770 is legit data. What's more, GEO2R will generate R code to import this data into R!

Had to BiocManager::install("GEOquery") first, then noticed line 58 of the generated script forgot to add a # before the set group names inline comment (weird). All of the data gets shoved into a gnarly object called gset of class ExpressionSet. Can biobroom help? We'll have to find out later, I need to go work out.

hammer commented 6 years ago

So, w/ biobroom:

gset_tidy <- tidy(gset)
gset_tidy %>% distinct(sample) %>% count()

We've lost a lot of metadata, and a lot of replicates; only GSM60348 to GSM60381 made it. I think we're going to have to tidy this thing ourselves.

hammer commented 6 years ago

After reviewing the GEO data model and how to work with S4 objects, I have a better handle on this data; gset_raw <- getGEO("GSE2770", GSEMatrix =TRUE, getGPL=FALSE) pulls down the data into a list of 3 ExpressionSets, one element for each Platform: GPL96, GPL97, GPL8300. There are 34 samples from each platform.

To tidy up this data:

gset_raw <- getGEO("GSE2770", GSEMatrix =TRUE, getGPL=FALSE)

GSE2770_tidy_full <- gset_raw %>%
  map(~ tidy(.x, addPheno = TRUE)) %>%
  bind_rows()

GSE2770_tidy <- GSE2770_tidy_full %>%
  select(gene, sample, title, value) %>%
  separate(title, c("cells", "treatment", "time", "replicate", "platform"), sep="[_\\(\\)]") %>%
  mutate(platform = replace_na(platform, "U95Av2")) %>%
  select(-cells) %>%
  mutate(replicate = map_chr(replicate, ~ str_split(.x, boundary("word"))[[1]][2])) %>%
  mutate(time = as.duration(time))

Note that GSM60381 has platform NA after the first call to separate; we hand-code U95Av2 based on examination of the GEO record. Another solution: pass platform during the bind_rows() call. We could go further and separate treatment and replica further. For now, let's move on.

The most important next steps: understand gene and value.

Gene

Some packages that may help

Annotations for the 3 platforms from this paper

Reading the ImmuneSigDB paper, they

mapped Affymetrix probe set identifiers to human gene symbols using the Collapse Dataset tool (max probe algorithm) of the GSEA program

In other words, add a column for HUGO gene, group on it, and keep the max value within the group.

Maybe fastest way to map, after install/load dance:

gpl3800_tib <- as_tibble(as.data.frame(hgu95av2SYMBOL))
gpl96_tib <- as_tibble(as.data.frame(hgu133aSYMBOL))
gpl97_tib <- as_tibble(as.data.frame(hgu133bSYMBOL))
probe2symbol <- bind_rows("U95Av2" = gpl3800_tib, "U133A" = gpl96_tib, "U133B" = gpl97_tib)
GSE2770_tidy_symbols <- GSE2770_tidy %>% left_join(probe2symbol, by = c("platform", "gene" = "probe_id"))

Here's another way to do it using biomaRt; each platform requires 5+ minutes to return the results from the web service, so I don't love this solution. I could easily reduce code duplication w/ iteration through a list of platforms but decided to be explicit.

ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
searchAttributes(ensembl, "affy") # affy_hg_u95av2, affy_hg_u133a, affy_hg_u133b

gpl8300_probes <- GSE2770_tidy %>% filter(platform == "U95Av2") %>% distinct(gene) %>% .$gene
gpl8300_genes <- getBM(attributes = c('affy_hg_u95av2', 'hgnc_symbol'), filters = 'affy_hg_u95av2', values = gpl8300_probes, mart = ensembl)
as_tibble(gpl8300_genes) %>% distinct(hgnc_symbol) %>% count() # 9,707 genes vs. 13,105 probes
as_tibble(as.data.frame(hgu95av2SYMBOL[gpl8300_probes])) %>% skim # 8,585 genes vs. 11,460 probes

gpl96_probes <- GSE2770_tidy %>% filter(platform == "U133A") %>% distinct(gene) %>% .$gene
gpl96_genes <- getBM(attributes = c('affy_hg_u133a', 'hgnc_symbol'), filters = 'affy_hg_u133a', values = gpl96_probes, mart = ensembl)
# 22,737 probes, 13,926 gene symbols

gpl97_probes <- GSE2770_tidy %>% filter(platform == "U133B") %>% distinct(gene) %>% .$gene
gpl97_genes <- getBM(attributes = c('affy_hg_u133b', 'hgnc_symbol'), filters = 'affy_hg_u133b', values = gpl97_probes, mart = ensembl)

Value

Since some genes have multiple probes mapped to them, we need to take the max value from all probes mapped to a gene to be compatible with the ImmuneSigDB analysis.

# Find genes with most probes
GSE2770_tidy_symbols %>% 
  group_by(symbol) %>%
  summarise(n_probes = n_distinct(gene)) %>% 
  arrange(desc(n_probes))

# Max probe "algorithm"
GSE2770_tidy_maxprobe <- GSE2770_tidy_symbols %>% 
  filter(!is.na(symbol)) %>%
  group_by(sample, symbol) %>%
  slice(which.max(value))

Before I bother with normalization, I thought it would be fun to make some plots, inspired by Figure 1 of the paper!

ifng <- GSE2770_tidy_maxprobe %>% filter(symbol == "IFNG")
ggplot(ifng, aes(x = as.duration(time)/dhours(1), y = value, color = treatment)) + geom_point() + theme_bw()

# repeat for IL4, T-bet (can't find in data! tried TBX21, TBLYM, TBET), GATA3
# mygene.info "reporter" field doesn't have probes for TBX21 in U95A, U133A, or U133B!

ifng

ImmuneSigDB normalized CEL files with justRMA from affy. Given that the CEL files are available for this paper, I should probably repeat this analysis with that data.

hammer commented 6 years ago

More information on why I couldn't find TBX21 in the previous analysis:

hammer commented 6 years ago

More information on the top 10 T cell gene signatures turned up in https://github.com/hammerlab/t-cell-review/issues/4#issuecomment-383351198:

  1. 14607935: GSE2770, Identification of novel genes regulated by IL-12, IL-4, or TGF-beta during the early polarization of CD4+ lymphocytes
  2. 20620947: GSE18017
  3. 15789058: GSE22886 (IRIS!)
  4. 16474395: GSE3982
  5. 15210650: GSE1460
  6. 22434910: GSE36476
  7. 15928199
  8. 16423401
  9. 19464196
  10. 19568420: GSE11103, another Alex Abbas paper
hammer commented 6 years ago

Okay time to work with CEL files!

files.df = getGEOSuppFiles("GSE2770") # puts file(s) into GSE2770/"
files = rownames(files.df) # just one .tar file
untar(files[1], exdir = "GSE2770/cels") # cel files still .gz
cels <- path_filter(dir_ls("GSE2770/cels"), glob = "*.gz")
ab <- ReadAffy(filenames=cels, compress=TRUE, verbose=TRUE) # Cel file GSE2770/cels/GSM60699.CEL.gz does not seem to have the correct dimensions (crap)

Okay, I am going to have to read and normalize the CEL files from the 3 different platforms separately. It's not obvious that differential expression analysis across platforms is a good idea, cf. Cross-study validation and combined analysis of gene expression microarray data (2008)

To get the .chip files, I'll need an FTP client in R. httr doesn't do ftp, ropensci/ftp is still half-baked, so I could use download.file() from base R or RCurl. Actually I think curl is best!

hammer commented 6 years ago

Let's just get the CEL files used to generate the GSE2770_IL12_AND_TGFB_ACT_VS_ACT_CD4_TCELL_6H_DN gene set.

First, I'll need to figure out which samples to use as input. I'm going to try working with the phenotype data from GEO. This involves interacting w/ ExpressionSet and AnnotatedDataFrame objects from Bioconductor, sadly.

GSE2770 <- getGEO("GSE2770") # There are 3 elements in this list, 1 for each platform
pGSE2770 <- GSE2770 %>%
  purrr::map(~ as(pData(.x), "data.frame")) %>%
  bind_rows() %>%
  as.tibble()

pGSE2770tidy <- pGSE2770 %>%
  select(platform_id, geo_accession, supplementary_file, title) %>%
  separate(title, c("cells", "treatment", "time", "replicate", "platform"), sep="[_\\(\\)]") %>%
  select(-cells) %>%
  mutate(replicate = map_chr(replicate, ~ str_split(.x, boundary("word"))[[1]][2])) %>%
  mutate(time = as.duration(time))

# get just the U133A cels
cels <- pGSE2770tidy %>%
  filter(time == duration("6h") & (treatment %in% c("antiCD3+antiCD28+IL12+TGFbeta", "antiCD3+antiCD28")) & platform == "U133A") %>%
  .$supplementary_file

ab <- ReadAffy(filenames=path("GSE2770/cels", cels, ext = "CEL.gz"), compress=TRUE, verbose=TRUE)
eset <- rma(ab)
treatments <- factor(c(1,2,2,1),labels=c("untreated", "treated")) # weird ordering of files
design <- model.matrix(~treatments)
fit <- lmFit(eset, design)
fit2 <- eBayes(fit)
gns <- select(hgu133a.db, featureNames(eset), c("ENSEMBL","SYMBOL","GENENAME"))
gns <- gns[!duplicated(gns[,1]),]
fit2$genes <- gns
results <- topTable(fit2, number=Inf)

# Not sure if this is UP or DOWN but it doesn't intersect much with either one!
results %>%
  filter(!is.na(SYMBOL)) %>%
  arrange(desc(logFC)) %>%
  head(200) %>%
  arrange(SYMBOL) %>%
  .$SYMBOL
hammer commented 6 years ago

Okay let's try to figure out why there's no overlap.

  1. get the Broad's chip files
  2. do the maxprobe thing prior to the fit thing
  3. sanity check the linear model for a few genes
  4. try to compute mutual information in R to see if I can reproduce the GSEA gene sets that way
U133A.chip <- read_tsv(curl("ftp://ftp.broadinstitute.org/pub/gsea/annotations/HG_U133A.chip"))
U133A.chip <- U133A.chip %>% 
  select(-X4) %>%
  clean_names() %>%
  rowwise() %>%
  mutate(first_gene_symbol = str_trim(str_split(gene_symbol, "///")[[1]][1]))

eset_tib <- as_tibble(exprs(eset), rownames = "probe_set_id")
eset_tib_genes <- eset_tib %>% left_join(U133A.chip, by = "probe_set_id")

# Remove probesets that don't map to a gene
eset_tib_genes %>% filter(is.na(first_gene_symbol)) %>% count() # 0
eset_tib_genes %>% filter(first_gene_symbol == "---") %>% count() # 1,109
eset_tib_genes <- eset_tib_genes %>% filter(first_gene_symbol != "---")

# Collapse probesets with max probe algorithm
eset_tib_genes_only <- eset_tib_genes %>%
  select(-c(probe_set_id, gene_symbol, gene_title)) %>%
  group_by(first_gene_symbol) %>%
  summarise_all(max)

# Make a new ExpressionSet object
eset_genes <- eset_tib_genes_only %>%
  as.data.frame() %>%
  column_to_rownames(var="first_gene_symbol") %>%
  data.matrix() %>%
  ExpressionSet()

# Do DGE
fit_genes <- lmFit(eset_genes, design)
fit2_genes <- eBayes(fit_genes)
results_genes <- topTable(fit2_genes, number=Inf)
results_genes_tib <- as_tibble(results_genes, rownames = "gene")

results_genes_tib %>%
  arrange(desc(logFC)) %>%
  head(200) %>%
  arrange(gene) %>%
  .$gene
hammer commented 6 years ago

Downloading MSigDB XML files from the CLI:

curl -d "j_username=jeff.hammerbacher%40gmail.com&j_password=password" http://software.broadinstitute.org/gsea/j_spring_security_check -c - | curl -b @- -O http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/6.2/msigdb_v6.2.xml

cf. curl manpage:

-d, --data

(HTTP) Sends the specified data in a POST request to the HTTP server, in the same way that a browser does when a user has filled in an HTML form and presses the submit button. This will cause curl to pass the data to the server using the content-type application/x-www-form-urlencoded. Compare to -F, --form.

--data-raw is almost the same but does not have a special interpretation of the @ character. To post data purely binary, you should instead use the --data-binary option. To URL-encode the value of a form field you may use --data-urlencode.

If any of these options is used more than once on the same command line, the data pieces specified will be merged together with a separating &-symbol. Thus, using '-d name=daniel -d skill=lousy' would generate a post chunk that looks like 'name=daniel&skill=lousy'.

If you start the data with the letter @, the rest should be a file name to read the data from, or - if you want curl to read the data from stdin. Multiple files can also be specified. Posting data from a file named 'foobar' would thus be done with -d, --data @foobar. When --data is told to read from a file like that, carriage returns and newlines will be stripped out. If you don't want the @ character to have a special interpretation use --data-raw instead.

See also --data-binary and --data-urlencode and --data-raw. This option overrides -F, --form and -I, --head and -T, --upload-file.

-c, --cookie-jar

(HTTP) Specify to which file you want curl to write all cookies after a completed operation. Curl writes all cookies from its in-memory cookie storage to the given file at the end of operations. If no cookies are known, no data will be written. The file will be written using the Netscape cookie file format. If you set the file name to a single dash, "-", the cookies will be written to stdout.

This command line option will activate the cookie engine that makes curl record and use cookies. Another way to activate it is to use the -b, --cookie option.

If the cookie jar can't be created or written to, the whole curl operation won't fail or even report an error clearly. Using -v, --verbose will get a warning displayed, but that is the only visible feedback you get about this possibly lethal situation.

If this option is used several times, the last specified file name will be used.

-b, --cookie

(HTTP) Pass the data to the HTTP server in the Cookie header. It is supposedly the data previously received from the server in a "Set-Cookie:" line. The data should be in the format "NAME1=VALUE1; NAME2=VALUE2".

If no '=' symbol is used in the argument, it is instead treated as a filename to read previously stored cookie from. This option also activates the cookie engine which will make curl record incoming cookies, which may be handy if you're using this in combination with the -L, --location option or do multiple URL transfers on the same invoke. If the file name is exactly a minus ("-"), curl will instead the contents from stdin.

The file format of the file to read cookies from should be plain HTTP headers (Set-Cookie style) or the Netscape/Mozilla cookie file format.

The file specified with -b, --cookie is only used as input. No cookies will be written to the file. To store cookies, use the -c, --cookie-jar option.

Exercise caution if you are using this option and multiple transfers may occur. If you use the NAME1=VALUE1; format, or in a file use the Set-Cookie format and don't specify a domain, then the cookie is sent for any domain (even after redirects are followed) and cannot be modified by a server-set cookie. If the cookie engine is enabled and a server sets a cookie of the same name then both will be sent on a future transfer to that server, likely not what you intended. To address these issues set a domain in Set-Cookie (doing that will include sub domains) or use the Netscape format.

If this option is used several times, the last one will be used.

Users very often want to both read cookies from a file and write updated cookies back to a file, so using both -b, --cookie and -c, --cookie-jar in the same command line is common.

hammer commented 6 years ago

From the MCP-counter paper:

MCP datasets and tumor datasets from Affymetrix Human Genome U133 Plus 2.0, Human Genome 133A, and HuGene 1.0 ST arrays were normalized using the frozen robust multiarray average (fRMA) method, implemented in the fRMA R package (version 1.18.0)

Loos like frma will allow us to compare across array platforms.