Closed komalsrathi closed 2 years ago
Yes, it makes sense to me. You can subset the LINCS GESS result to compounds you are interested in.
Thanks, Yuzhu
Thanks, I'll close this now and open if I face any issues.
@yduan004
Hi, I would like to reopen this. The LINCS search space takes a really long time. Referring to the manual and looking at dtlink_db_clue_sti
, can you please inform me if eh[["EH3228"]]
includes the touchstone database from clue.io? It looks like a much smaller file compared to the LINCS database i.e. eh[["EH3226"]]
Hi Komalsrathi,
The eh["EH3228"] database contains drug-target interaction annotations. To minimize noise, this database is filtered such that drugs with more than 100 distinct targets are removed. Similarly, targets with more than 100 annotated drugs are excluded. Therefore, the drugs included in this database overlap with touchstone compounds but it also contains non-touchstone compounds. Regarding the use of touchstone compounds for signatureSearhing, I would recommend using the more updated version of the database which is available here: query(eh, c("signatureSearchData", "lincs2")); eh[["EH7297"]] This database contains only touchstone treatments in the LINCS database.
Regards, Brendan
Based on the description I would assume LINCS2 EH7297
to be much smaller than LINCS EH3226
as it only contains touchstone data in the LINCS db therefore it should be the subset of LINCS? But when I look at the cache files it is opposite.
> lincs
EH3226
"/Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d257f281e_3242"
> lincs2
EH7297
"/Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/143d979e018ae_7347"
> dtlink_db_clue_sti
EH3228
"/Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d24d68ff91_3244"
# size of lincs
du -sh /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d257f281e_3242
2.3G /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d257f281e_3242
# size of lincs2
du -sh /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/143d979e018ae_7347
5.8G /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/143d979e018ae_7347
# size of dtlink_db_clue_sti
du -sh /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d24d68ff91_3244
5.4M /Users/rathik/Library/Caches/org.R-project.R/R/ExperimentHub/74d24d68ff91_3244
Hi Komalsrathi,
Yes, that is correct. The new release of the LINCS database contains a more comprehensive selection of drugs across different treatments and cell types. Therefore, although we have filtered the database (eh[["EH7297"]]) to contain only touchstone treatments, it is larger than the previous version. This is the database I recommend using for signatureSearching.
Regards, Brendan
Thanks a lot for the explanation, I'll use EH7297 going forward.
Hi komalsrathi,
Great! Thanks for reaching out!
Regards, Brendan
Hi @komalsrathi,
The signatureSearchData vignette has detailed decription of the LINCS and LINCS2 databases and how many signature entries they contain. The 2429 drugs in the Touchstone database is just a subset of them. The dtlink_db_clue_sti
is not a signature database, it contains drug-target annotation tables obtained from DrugBank, CLUE and STITCH databases. If you do not want to spend a lot of time to search the full LINCS/LINCS2 database (LINCS2 is even much larger than LINCS), and only want to search a subset of the interested signatures in the the reference database to reduce the run time, you could try running the following code and modify it to suit in your research purpose. Basically, there is a ref_trts
argument in the gess_*
functions that allow users to search a subset of the refdb.
I hope the following code makes sense to you and is what you ask for. If you have any questions, please feel free to contact us.
library(signatureSearch); library(ExperimentHub); library(HDF5Array)
eh <- ExperimentHub()
lincs_h5 <- eh[['EH3226']]
lincs_db <- SummarizedExperiment(HDF5Array(lincs_h5, name="assay"))
rownames(lincs_db) <- HDF5Array(lincs_h5, name="rownames")
colnames(lincs_db) <- HDF5Array(lincs_h5, name="colnames")
dim(lincs_db) # 12328 x 45956
## Get a subset of the treatments of interest from the large reference database to reduce search time
## here I randomly sampled 100 treatments as a demonstration
ref_trts <- sample(colnames(lincs_db), 100)
head(ref_trts) # [1] "BRD-A06641369__A549__trt_cp" "roquinimex__A549__trt_cp" "BRD-K84447176__PC3__trt_cp" ...
## generate a query signature as a demonstration
query_mat <- as.matrix(assay(lincs_db[,"vorinostat__SKB__trt_cp"]))
query <- as.numeric(query_mat); names(query) <- rownames(query_mat)
upset <- head(names(query[order(-query)]), 150)
head(upset)
downset <- tail(names(query[order(-query)]), 150)
head(downset)
## Search a subset of the LINCS database by using `ref_trts` argument
qsig_demo <- qSig(query=list(upset=upset, downset=downset), gess_method="LINCS",
refdb="lincs")
res <- gess_lincs(qsig_demo, sortby="NCS", tau=FALSE, ref_trts=ref_trts)
result(res)
## The above process also applies to the LINCS2 database
@yduan004 Thanks for the explanation, looks like I needed to use the ref_trts parameter.
I am currently getting the touchstone compounds like this:
data('clue_moa_list')
touchstone_data <- unlist(clue_moa_list)
Does this look right? If not, where can I download the Touchstone database?
Reopening as I am still unsure how to subset the EH3226 db to touchstone compounds only. The signature search method comes with clue_moa_list
but not sure if that is the correct db as it has 2384 drugs instead of 2429 as suggested in the comment above.
I am thinking something like this but unsure if this is correct:
# get LINCS data
eh <- ExperimentHub()
lincs <- eh[["EH3226"]]
lincs_db <- SummarizedExperiment(HDF5Array(lincs, name="assay"))
rownames(lincs_db) <- HDF5Array(lincs, name="rownames")
colnames(lincs_db) <- HDF5Array(lincs, name="colnames")
ref_trts <- gsub("__.*", "", colnames(lincs_db)) # format names so it can be matched with touchstone compounds
# MOA terms to drug name mappings obtained from Touchstone database at CLUE website
data('clue_moa_list')
touchstone_data <- unlist(clue_moa_list)
touchstone_data <- stack(touchstone_data)
ref_trts <- colnames(lincs_db)[which(ref_trts %in% touchstone_data$values)]
Hi @komalsrathi ,
The clue_moa_list
contains Touchstone drugs with MOA annotations, so the number is slightly less than 2429. You could obtain the full Touchstone drug table by downloading from the clue website as shown in the following image and then subset the refdb columns by using your above codes.
Thank you!
Hi,
How would I go about reducing LINCS search space to touchstone set of compounds (https://clue.io/connectopedia/the_touchstone_dataset). The idea is to cut down the drug signatures that are queried up front to only well-characterized compounds.
Do you think something like this would make sense?
Thanks!