Error in vapply(seq_along(esout), function(i) unlist(strsplit(as.character(names(esout)[i])

enricoferrero commented 4 years ago

Hi Yuzhu,

This is not really an issue, more of a request for help.

I created a new reference database from the LINCS data on GEO because I wanted to include both GSE92742 and GSE70138 and keep the genetic perturbations. I then filtered this and ended up with an HDF5 file of 166K signatures/columns and the same number of genes/rows as the 'lincs' object.

I inspected the object and from what I can see it looks exactly like the 'lincs' object, just with more signatures.

However, when I run the gess_lincs() function, I get this error and warning:

Error in vapply(seq_along(esout), function(i) unlist(strsplit(as.character(names(esout)[i]),  : 
  values must be length 3,
 but FUN(X[[1]]) result is length 1
In addition: Warning message:
In .lincsScores(esout = ESout, upset = upset, downset = downset,  :
  QueryDB and tauRefDB differ by 100% of their entries. Accurate tau computation requires close to 0% divergence.

Have you by any chance seen this before and would you know what might be causing it?

Thank you! Enrico

yduan004 commented 4 years ago

Hi Enrico,

I think the errors might because the column names of the matrix are not in the proper format. The column names are expected to have this structure: (drug)__(cell)__(factor), e.g. sirolimus__MCF7__trt_cp. This format is flexible enough to encode most perturbation types of biological samples. For example, gene knockdown or overexpression treatments can be specified by assigning the ID of the affected gene to drug, and ko or ov to factor, respectively. An example for a knockdown treatment would look like this: P53__MCF7__ko. For more details, please refer to the help file of the build_custom_db function in the signatureSearch package.

Could you please try to change the column names of your matrix/data.frame to the proper format and try again? If you have any further problems, please feel free to ask me.

Thanks, Yuzhu

enricoferrero commented 4 years ago

Thanks Yuzhu!

I can confirm formatting the column names with the format you suggested no longer gives that error. I still get the following warnings though:

1: In .lincsScores(esout = ESout, upset = upset, downset = downset,  ... :
  QueryDB and tauRefDB differ by 32% of their entries. Accurate tau computation requires close to 0% divergence. 

2: In sign(ncs_query_list[[x]]) * 100/ncol(tmpDF) * rowSums(abs(tmpDF) <  ... :
  longer object length is not a multiple of shorter object length

I also note that I have a few NA in the Tau column of the result tibble of the gessResult object. Should I worry about these warnings?

More generally, I find the required format for column names to be too restrictive, and I'm wondering whether you are considering any alternatives.

In my case, I would like to test my queries against multiple signatures obtained with the same perturbagen in the same cell line, but with different time points and doses. With the current naming scheme, I have no way to distinguish between such signatures. Indeed, I have multiple entries with the same perturbagen__cell__pert_type name in the final HDF5 object (which luckily doesn't seem to be a problem).

Perhaps one option could be to stick to the original CMap/LINCS sig_id and then use an annotation object to recover all the information about the perturbations. Another could be to also include the dose and time in the required naming structure.

yduan004 commented 4 years ago

Hi Enrico,

You could ignore the first warning message since computing the accurate Tau score only applies to the pre-built lincs database since the tauRefDB is calculated upon the lincs db, any other reference database will raise this warning message.

The second warning message may be raised because your reference database has duplicated column names since you have multiple time and dose samples for the same pert and cell line. One workaround solution will be adding the time and dose information to the perturbation name, for example, sirolimus_24h_10um__MCF7__trt_cp, sirolimus_24h_20um__MCF7__trt_cp. In this way, you could see the specific samples including the dose and time in the pert columns of the GESS result table instead of just a compound name.

I am sorry that you need to use the above workaround method now. For a long-term solution, I will improve the package by better organizing the metadata handling. Thanks for bringing it up and giving me valuable suggestions!

Thanks, Yuzhu

enricoferrero commented 4 years ago

Thanks Yuzhu! I will try that and report back. Can I still rely on the Tau calculation though? I find it a useful metric but if it's not reliable when calculated on a different database then I might just set tau = FALSE as it's considerably faster.

yduan004 commented 4 years ago

Hi Enrico,

Thanks for trying! Since QueryDB and tauRefDB differ by 32% of their entries, I think the Tau scores are not accurate enough. You could set tau = FALSE to make it faster. If you really want to have Tau score, you might need to generate your own tauRefDB, which is really time-consuming since you have many more entries in your reference database and you need to run GESS searches for each entry. If you have any questions, feel free to contact me.

Best, Yuzhu

enricoferrero commented 4 years ago

OK, I can confirm a format such as perturbagen|dose|time__cell__pert_type works.

I also think the second warning I reported above:

2: In sign(ncs_query_list[[x]]) * 100/ncol(tmpDF) * rowSums(abs(tmpDF) <  ... :
  longer object length is not a multiple of shorter object length

applies to the Tau calculation, not to the fact that there are duplicate column names.

Thanks again for considering another framework for the naming structure - in the meanwhile the workaround appears to be working!

P.S.: I have another problem but I'll open a separate issue for that.

yduan004 commented 4 years ago

Yes, you are right, the warning information comes from the Tau calculation.

The reason that some Tau scores are NAs is the lincsEnrich function (not exported) defines a parameter called minTauRefSize=500 if the number of signatures in Qref that match the cell line of signature r (the TauRefSize column in the GESS result) is less than 500, Tau will be set as NA since it is redeemed as there are not large enough samples for computing meaningful Tau scores.

Thanks, Yuzhu

girke-lab / signatureSearch

Error in vapply(seq_along(esout), function(i) unlist(strsplit(as.character(names(esout)[i]) #2