Clue.io Touchstone dataset

komalsrathi commented 2 years ago

Hi,

How would I go about reducing LINCS search space to touchstone set of compounds (https://clue.io/connectopedia/the_touchstone_dataset). The idea is to cut down the drug signatures that are queried up front to only well-characterized compounds.

Do you think something like this would make sense?

eh <- ExperimentHub()
lincs <- eh[["EH3226"]]

qSig_output <- qSig(query = query, gess_method="LINCS", refdb = lincs)
qSig_output <- gess_lincs(qSig = qSig_output, sortby = "NCS", tau = T, workers = 1)
qSig_output <- result(qSig_output)

# touchstone dataset
data('clue_moa_list')
touchstone_data <- unlist(clue_moa_list)

# filter drugs
 qSig_output <- qSig_output %>% filter(pert %in% touchstone_data)

Thanks!

yduan004 commented 2 years ago

Yes, it makes sense to me. You can subset the LINCS GESS result to compounds you are interested in.

Thanks, Yuzhu

komalsrathi commented 2 years ago

Thanks, I'll close this now and open if I face any issues.

komalsrathi commented 2 years ago

@yduan004

Hi, I would like to reopen this. The LINCS search space takes a really long time. Referring to the manual and looking at dtlink_db_clue_sti, can you please inform me if eh[["EH3228"]] includes the touchstone database from clue.io? It looks like a much smaller file compared to the LINCS database i.e. eh[["EH3226"]]

brendangongol commented 2 years ago

Hi Komalsrathi,

The eh["EH3228"] database contains drug-target interaction annotations. To minimize noise, this database is filtered such that drugs with more than 100 distinct targets are removed. Similarly, targets with more than 100 annotated drugs are excluded. Therefore, the drugs included in this database overlap with touchstone compounds but it also contains non-touchstone compounds. Regarding the use of touchstone compounds for signatureSearhing, I would recommend using the more updated version of the database which is available here: query(eh, c("signatureSearchData", "lincs2")); eh[["EH7297"]] This database contains only touchstone treatments in the LINCS database.

Regards, Brendan

komalsrathi commented 2 years ago

Based on the description I would assume LINCS2 EH7297 to be much smaller than LINCS EH3226 as it only contains touchstone data in the LINCS db therefore it should be the subset of LINCS? But when I look at the cache files it is opposite.

> lincs
                                                                         EH3226 
"/Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d257f281e_3242" 

> lincs2
                                                                           EH7297 
"/Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/143d979e018ae_7347" 

> dtlink_db_clue_sti
                                                                          EH3228 
"/Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d24d68ff91_3244" 

# size of lincs
du -sh /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d257f281e_3242
2.3G    /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d257f281e_3242

# size of lincs2
du -sh /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/143d979e018ae_7347
5.8G    /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/143d979e018ae_7347

# size of dtlink_db_clue_sti
du -sh /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d24d68ff91_3244
5.4M    /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d24d68ff91_3244

brendangongol commented 2 years ago

Hi Komalsrathi,

Yes, that is correct. The new release of the LINCS database contains a more comprehensive selection of drugs across different treatments and cell types. Therefore, although we have filtered the database (eh[["EH7297"]]) to contain only touchstone treatments, it is larger than the previous version. This is the database I recommend using for signatureSearching.

Regards, Brendan

komalsrathi commented 2 years ago

Thanks a lot for the explanation, I'll use EH7297 going forward.

brendangongol commented 2 years ago

Hi komalsrathi,

Great! Thanks for reaching out!

Regards, Brendan

yduan004 commented 2 years ago

Hi @komalsrathi, The signatureSearchData vignette has detailed decription of the LINCS and LINCS2 databases and how many signature entries they contain. The 2429 drugs in the Touchstone database is just a subset of them. The dtlink_db_clue_sti is not a signature database, it contains drug-target annotation tables obtained from DrugBank, CLUE and STITCH databases. If you do not want to spend a lot of time to search the full LINCS/LINCS2 database (LINCS2 is even much larger than LINCS), and only want to search a subset of the interested signatures in the the reference database to reduce the run time, you could try running the following code and modify it to suit in your research purpose. Basically, there is a ref_trts argument in the gess_* functions that allow users to search a subset of the refdb.

I hope the following code makes sense to you and is what you ask for. If you have any questions, please feel free to contact us.

library(signatureSearch); library(ExperimentHub); library(HDF5Array)
eh <- ExperimentHub()
lincs_h5 <- eh[['EH3226']] 
lincs_db <- SummarizedExperiment(HDF5Array(lincs_h5, name="assay"))
rownames(lincs_db) <- HDF5Array(lincs_h5, name="rownames")
colnames(lincs_db) <- HDF5Array(lincs_h5, name="colnames")
dim(lincs_db) # 12328 x 45956
## Get a subset of the treatments of interest from the large reference database to reduce search time
## here I randomly sampled 100 treatments as a demonstration
ref_trts <- sample(colnames(lincs_db), 100)
head(ref_trts) # [1] "BRD-A06641369__A549__trt_cp" "roquinimex__A549__trt_cp" "BRD-K84447176__PC3__trt_cp" ...
## generate a query signature as a demonstration
query_mat <- as.matrix(assay(lincs_db[,"vorinostat__SKB__trt_cp"]))
query <- as.numeric(query_mat); names(query) <- rownames(query_mat)
upset <- head(names(query[order(-query)]), 150)
head(upset)
downset <- tail(names(query[order(-query)]), 150)
head(downset)
## Search a subset of the LINCS database by using `ref_trts` argument
qsig_demo <- qSig(query=list(upset=upset, downset=downset), gess_method="LINCS", 
                  refdb="lincs")
res <- gess_lincs(qsig_demo, sortby="NCS", tau=FALSE, ref_trts=ref_trts)
result(res)

## The above process also applies to the LINCS2 database

komalsrathi commented 2 years ago

@yduan004 Thanks for the explanation, looks like I needed to use the ref_trts parameter.

I am currently getting the touchstone compounds like this:

data('clue_moa_list')
touchstone_data <- unlist(clue_moa_list)

Does this look right? If not, where can I download the Touchstone database?

komalsrathi commented 2 years ago

Reopening as I am still unsure how to subset the EH3226 db to touchstone compounds only. The signature search method comes with clue_moa_list but not sure if that is the correct db as it has 2384 drugs instead of 2429 as suggested in the comment above.

I am thinking something like this but unsure if this is correct:

# get LINCS data
eh <- ExperimentHub()
lincs <- eh[["EH3226"]]
lincs_db <- SummarizedExperiment(HDF5Array(lincs, name="assay"))
rownames(lincs_db) <- HDF5Array(lincs, name="rownames")
colnames(lincs_db) <- HDF5Array(lincs, name="colnames")
ref_trts <- gsub("__.*", "", colnames(lincs_db)) # format names so it can be matched with touchstone compounds

# MOA terms to drug name mappings obtained from Touchstone database at CLUE website 
data('clue_moa_list')
touchstone_data <- unlist(clue_moa_list)
touchstone_data <- stack(touchstone_data)
ref_trts <- colnames(lincs_db)[which(ref_trts %in% touchstone_data$values)]

yduan004 commented 2 years ago

Hi @komalsrathi ,

The clue_moa_list contains Touchstone drugs with MOA annotations, so the number is slightly less than 2429. You could obtain the full Touchstone drug table by downloading from the clue website as shown in the following image and then subset the refdb columns by using your above codes.

komalsrathi commented 2 years ago

Thank you!

girke-lab / signatureSearch

Clue.io Touchstone dataset #10