waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
70 stars 32 forks source link

wideFormat Variable Identifiability #233

Closed DarioS closed 6 years ago

DarioS commented 6 years ago

wideFormat identifies the variables in the generated table by concatenating identifiers, by default. To do further analysis or plotting, string splitting may be necessary to be able to identify which variable and which experiment is represented by a particular column. Perhaps this information could be made automatically available. To illustrate my idea:

> wideTable
DataFrame with 5 rows and 5 columns
   primary         class RNA_Gene1 RNA_Gene_Z Protein_Gene1
  <factor>   <character> <integer>  <integer>     <integer>
1 Sample 1     Recovered         1          6            10
2 Sample 2     Recovered         2          7             9
3 Sample 3     Recovered         3          8             8
4 Sample 4 Non-responder         4          9             7
5 Sample 5 Non-responder         5         10             6
> attr(wideTable, "varInfo")
  column   table variable
1      1    <NA>     <NA>
2      2 colData    class
3      3     RNA    Gene1
4      4     RNA   Gene_Z
5      5 Protein    Gene1

Also, some genes in popular databases like GENCODE Genes have underscores in their symbols, such as HOTTIP_1, Hammerhead_HH9, HOTAIR_2. This makes it harder to identify the table and variable of each column from the wide format identifiers.

LiNk-NY commented 6 years ago

Hi Dario, @DarioS Thanks for the suggestion. It is possible to use mcols for the DataFrame to provide this data. I will look into it.

Regards, Marcel

lwaldron commented 6 years ago

Agreed, no harm and some utility in storing this in mcols. For the wideFormat delimiter, would it make sense to expose the collapse argument?

wideFormat(object, collapse="_", ...) ?

LiNk-NY commented 6 years ago

I've added the feature you talked about. Regards, Marcel

DarioS commented 6 years ago

Thank you. This will be useful. The third column seems unnecessary or incorrect because it's all NA. The example

Names <- c("John", "Sam", "Ed")
Symbols <- c("BRAF", "NRAS", "PTEN")
RNA <- matrix(1:9, ncol = 3)
rownames(RNA) <- Symbols
colnames(RNA) <- Names
protein <- matrix(1:4, ncol = 2)
rownames(protein) <- Symbols[1:2]
colnames(protein) <- Names[1:2]
clinical <- DataFrame(status = c("Alive", "Alive", "Dead"), row.names = Names)

library(MultiAssayExperiment)
test <- MultiAssayExperiment(list(RNA = RNA, Protein = protein), clinical)
testTable <- wideFormat(test, colDataCols = "status")

obtains the result

> mcols(testTable)
DataFrame with 7 rows and 3 columns
   sourceName  rowname  colname
  <character> <factor> <factor>
1 colDataRows       NA       NA
2 colDataCols       NA       NA
3         RNA     BRAF       NA
4         RNA     NRAS       NA
5         RNA     PTEN       NA
6     Protein     BRAF       NA
7     Protein     NRAS       NA

Should mcols(testTable)[2, 3] be status or should there be no colname column?

LiNk-NY commented 6 years ago

Hi Dario, @DarioS This column is for cases where replicates are present. I will update the hard coded value in the code.

lwaldron commented 6 years ago

Why not fill the colname column whether or not there are replicates present?

LiNk-NY commented 6 years ago

Yes, that's what the update does. Update: After re-reading this, I'm not quite sure what you meant :sweat_smile:

DarioS commented 6 years ago

Using the example presented previously, I find that the latest version drops the colname column rather than it containing values.

>  mcols(testTable)
DataFrame with 7 rows and 2 columns
   sourceName  rowname
  <character> <factor>
1 colDataRows       NA
2 colDataCols       NA
3         RNA     BRAF
4         RNA     NRAS
5         RNA     PTEN
6     Protein     BRAF
7     Protein     NRAS
LiNk-NY commented 6 years ago

Hi Dario, @DarioS

That's because there are no replicates present.

I didn't quite catch @lwaldron 's point about filling the column in the way that he was thinking about it.

The name is split into 3 pieces when replicates are present: assay_rowname_colname When no replicates are present, the name looks like the following: assay_rowname

These "pieces" correspond to the number of columns in mcols(testTable).

-M

lwaldron commented 6 years ago

My point was, even if colname is no longer needed to disambiguate when there are no replicates present, why not always have sourceName rowname colname in mcols(testTable)? Seems like a stricter contract, no harm, and additional information, unless there are technical challenges to doing it.

LiNk-NY commented 6 years ago

Yes, I was previously keeping all columns in the resulting mcols.

Considering that most of the time there are few replicate samples, it may be better to fill in this column in the case that they are present.

lwaldron commented 6 years ago

IIRC, previously you were keeping all columns, but the colnames column was filled with NA unless there was at least one replicate present.

LiNk-NY commented 6 years ago

That's right but it seems like most datasets will not need that column due to low frequency of replicates.

lwaldron commented 6 years ago

My point was that there may be other reasons for wanting to know the colnames besides disambiguating replicates. Just as an example, in TCGA the assay colnames are barcodes that could tell you about tissue types, sequencing centers, and batches in which the sample was assayed. Who knows how users might use the colnames data, but recovering them is non-trivial once they're dropped. I'm not strongly arguing for keeping the colnames, just asking why not keep them around?

DarioS commented 6 years ago

Both of those implementations are workable, but having a constant number of columns makes interacting with the table easier. For example, to extract the feature name of each column:

feature <- mcols(wideTable)[, "rowname"]
feature[is.na(feature)] <- mcols(wideTable)[is.na(feature), "colname"]

If these weren't in the columns metadata, the code would be:

feature <- mcols(wideTable)[, "rowname"]
if("colname" %in% colnames(mcols(wideTable)))
  feature[is.na(feature)] <- mcols(wideTable)[is.na(feature), "colname"]
else
  feature[is.na(feature)] <- colnames(wideTable)[is.na(feature)]
lwaldron commented 6 years ago

It's cyclomatic complexity! It does seem like a good argument to keep the contract strict by providing the colnames column and always placing the ExperimentList colnames in them, whether or not there are any replicates.

LiNk-NY commented 6 years ago

Hi Levi, @lwaldron I understood that you want to populate the colname column regardless of presence of replicates. This can't happen when multiple "primary" entries populate the particular wideFormat column.

The colname in mcols can only be used to disambiguate when there are replicates because it will correspond to a single "primary" row (1 to 1 relationship).

Otherwise, there is no way to document the "colname" column since each row in the wideFormat DataFrame will indicate a "primary" sample (1 to many relationship).

Thus, the corresponding mcol annotation would have to point to 5 different samples and this is not possible in a "tidy" way anyway.

> wideDF[1:5, 1:3]
       primary   COAD_RNASeq2GeneNorm-20160128_A1BG   COAD_RNASeq2GeneNorm-20160128_A1CF
   <character>                            <numeric>                            <numeric>
1 TCGA-3L-AA1B                                   NA                                   NA
2 TCGA-4N-A93T                                   NA                                   NA
3 TCGA-4T-AA8H                                   NA                                   NA
4 TCGA-5M-AAT6                                   NA                                   NA
5 TCGA-A6-2671                               25.418                              10.8359
> mcols(wideDF[1:5, 1:3])
DataFrame with 5 rows and 3 columns
                     sourceName  rowname  colname
                    <character> <factor> <factor>
1                   colDataRows       NA       NA
2 COAD_RNASeq2GeneNorm-20160128     A1BG       NA
3 COAD_RNASeq2GeneNorm-20160128     A1CF       NA

I have implemented the permanent solution of always having a "colname" column in mcols(wideDF). I will push this soon.

Regards, Marcel

DarioS commented 6 years ago

I didn't previously realise that there are complexities which restrict the contents of the column names information. However, for my example above, the row names are now also all NA. It seems like the presence of clinical data has caused too much of the metadata to be removed in the latest design.

> mcols(testTable)
DataFrame with 7 rows and 3 columns
      sourceName  rowname  colname
     <character> <factor> <factor>
1    colDataRows       NA       NA
2    colDataCols       NA       NA
3     RNA...BRAF       NA       NA
4     RNA...NRAS       NA       NA
5     RNA...PTEN       NA       NA
6 Protein...BRAF       NA       NA
7 Protein...NRAS       NA       NA
LiNk-NY commented 6 years ago

Hi Dario, @DarioS

Thanks for checking this, it's actually a bug with check.names. I will fix it shortly.

It should look like this:

> mcols(testTable)
DataFrame with 7 rows and 3 columns
   sourceName  rowname  colname
  <character> <factor> <factor>
1 colDataRows       NA       NA
2 colDataCols       NA       NA
3         RNA     BRAF       NA
4         RNA     NRAS       NA
5         RNA     PTEN       NA
6     Protein     BRAF       NA
7     Protein     NRAS       NA

Regards, Marcel