Query gene sets with no shared gene identifiers with reference database

enricoferrero commented 4 years ago

Hi @yduan004,

It's me again :)

So I have a large set of queries and some of these do not share any gene identifiers with the reference database, eg:

# this works
> query1 <- list(upset=c("23645", "5290"), downset=c("54957", "2767"))
> qsig1 <- qSig(query = query1, gess_method = "LINCS", refdb = cmap)
2 / 2 genes in up set share identifiers with reference database
2 / 2 genes in down set share identifiers with reference database

# this doesn't work
> query2 <- list(upset=c("23645", "5290"), downset=c("7117"))
> qsig2 <- qSig(query = query2, gess_method = "LINCS", refdb = cmap)
2 / 2 genes in up set share identifiers with reference database
0 / 1 genes in down set share identifiers with reference database
Error in qSig(query = query2, gess_method = "LINCS", refdb = cmap) : 
  downset shares zero idenifiers with reference database, 
               please set downset of 'qsig' slot as NULL

So what I did was to follow the suggestion in the error message, but it gives me another error:

> query2$downset <- NULL
> qsig2 <- qSig(query = query2, gess_method = "LINCS", refdb = cmap)
Error in query[[2]] : subscript out of bounds

I even tried a few different things, such as setting it to NA, an empty character vector or an empty string, but nothing works:

> query2$downset <- NA
> qsig2 <- qSig(query = query2, gess_method = "LINCS", refdb = cmap)
Error in qSig(query = query2, gess_method = "LINCS", refdb = cmap) : 
  downset of 'qsig' slot needs to be ID character vector or NULL

> query2$downset <- character()
> qsig2 <- qSig(query = query2, gess_method = "LINCS", refdb = cmap)
2 / 2 genes in up set share identifiers with reference database
0 / 0 genes in down set share identifiers with reference database
Error in qSig(query = query2, gess_method = "LINCS", refdb = cmap) : 
  downset shares zero idenifiers with reference database, 
               please set downset of 'qsig' slot as NULL

> query2$downset <- ""
> qsig2 <- qSig(query = query2, gess_method = "LINCS", refdb = cmap)
2 / 2 genes in up set share identifiers with reference database
0 / 1 genes in down set share identifiers with reference database
Error in qSig(query = query2, gess_method = "LINCS", refdb = cmap) : 
  downset shares zero idenifiers with reference database, 
               please set downset of 'qsig' slot as NULL

So, how should I format gene sets that don't share gene identifiers with the reference database?

Ideally this would be handled gracefully by the qSig() or gess_lincs() functions which should recognize there are no identifiers in common and act accordingly, without throwing an error.

For me to know how many identifiers in common there are, I first have to get all the identifiers in the reference database, remove from my query the ones that are not, and then set the empty gene sets to NULL... which unfortunately doesn't work as described above.

Thank you!

yduan004 commented 4 years ago

Oh, it seems like there is a bug in the qSig function, I will fix it and let you know as soon as possible. Thanks for pointing it out!

yduan004 commented 4 years ago

Hi Enrico,

I have fixed the qSig function. Now it throws a warning message when upset or downset share 0 identifiers with reference database and automatically set them as NULL so you don't need to double-check it. I hope it works well for your cases.

I have pushed the update to this GitHub repository as well as Bioconductor's development branch. The new version 1.1.5 will be available at Bioconductor development one or two days later, you could install it from there then. If you want the updates immediately, you could still install the newest version from this GitHub repository.

If you have any further problems, please let me know.

Thanks for your hard working and patience!

Yuzhu

enricoferrero commented 4 years ago

Thank you @yduan004! I installed the latest version from GitHub and I can confirm it works. It's a much better user experience as I can now provide any gene sets and it's all handled gracefully by the function.

girke-lab / signatureSearch

Query gene sets with no shared gene identifiers with reference database #3