Closed DarioS closed 6 years ago
Hi Dario, @DarioS
Thanks for the suggestion. It is possible to use mcols
for the DataFrame
to provide this data.
I will look into it.
Regards, Marcel
Agreed, no harm and some utility in storing this in mcols
. For the wideFormat delimiter, would it make sense to expose the collapse argument?
wideFormat(object, collapse="_", ...)
?
I've added the feature you talked about. Regards, Marcel
Thank you. This will be useful. The third column seems unnecessary or incorrect because it's all NA
. The example
Names <- c("John", "Sam", "Ed")
Symbols <- c("BRAF", "NRAS", "PTEN")
RNA <- matrix(1:9, ncol = 3)
rownames(RNA) <- Symbols
colnames(RNA) <- Names
protein <- matrix(1:4, ncol = 2)
rownames(protein) <- Symbols[1:2]
colnames(protein) <- Names[1:2]
clinical <- DataFrame(status = c("Alive", "Alive", "Dead"), row.names = Names)
library(MultiAssayExperiment)
test <- MultiAssayExperiment(list(RNA = RNA, Protein = protein), clinical)
testTable <- wideFormat(test, colDataCols = "status")
obtains the result
> mcols(testTable)
DataFrame with 7 rows and 3 columns
sourceName rowname colname
<character> <factor> <factor>
1 colDataRows NA NA
2 colDataCols NA NA
3 RNA BRAF NA
4 RNA NRAS NA
5 RNA PTEN NA
6 Protein BRAF NA
7 Protein NRAS NA
Should mcols(testTable)[2, 3]
be status
or should there be no colname
column?
Hi Dario, @DarioS This column is for cases where replicates are present. I will update the hard coded value in the code.
Why not fill the colname column whether or not there are replicates present?
Yes, that's what the update does. Update: After re-reading this, I'm not quite sure what you meant :sweat_smile:
Using the example presented previously, I find that the latest version drops the colname
column rather than it containing values.
> mcols(testTable)
DataFrame with 7 rows and 2 columns
sourceName rowname
<character> <factor>
1 colDataRows NA
2 colDataCols NA
3 RNA BRAF
4 RNA NRAS
5 RNA PTEN
6 Protein BRAF
7 Protein NRAS
Hi Dario, @DarioS
That's because there are no replicates present.
I didn't quite catch @lwaldron 's point about filling the column in the way that he was thinking about it.
The name is split into 3 pieces when replicates are present:
assay_rowname_colname
When no replicates are present, the name looks like the following:
assay_rowname
These "pieces" correspond to the number of columns in mcols(testTable)
.
-M
My point was, even if colname is no longer needed to disambiguate when there are no replicates present, why not always have sourceName rowname colname
in mcols(testTable)
? Seems like a stricter contract, no harm, and additional information, unless there are technical challenges to doing it.
Yes, I was previously keeping all columns in the resulting mcols
.
Considering that most of the time there are few replicate samples, it may be better to fill in this column in the case that they are present.
IIRC, previously you were keeping all columns, but the colnames
column was filled with NA
unless there was at least one replicate present.
That's right but it seems like most datasets will not need that column due to low frequency of replicates.
My point was that there may be other reasons for wanting to know the colnames besides disambiguating replicates. Just as an example, in TCGA the assay colnames are barcodes that could tell you about tissue types, sequencing centers, and batches in which the sample was assayed. Who knows how users might use the colnames data, but recovering them is non-trivial once they're dropped. I'm not strongly arguing for keeping the colnames, just asking why not keep them around?
Both of those implementations are workable, but having a constant number of columns makes interacting with the table easier. For example, to extract the feature name of each column:
feature <- mcols(wideTable)[, "rowname"]
feature[is.na(feature)] <- mcols(wideTable)[is.na(feature), "colname"]
If these weren't in the columns metadata, the code would be:
feature <- mcols(wideTable)[, "rowname"]
if("colname" %in% colnames(mcols(wideTable)))
feature[is.na(feature)] <- mcols(wideTable)[is.na(feature), "colname"]
else
feature[is.na(feature)] <- colnames(wideTable)[is.na(feature)]
It's cyclomatic complexity! It does seem like a good argument to keep the contract strict by providing the colnames
column and always placing the ExperimentList colnames in them, whether or not there are any replicates.
Hi Levi, @lwaldron
I understood that you want to populate the colname
column regardless of presence
of replicates. This can't happen when multiple "primary" entries populate the particular
wideFormat
column.
The colname
in mcols
can only be used to disambiguate when there are replicates because it
will correspond to a single "primary" row (1 to 1 relationship).
Otherwise, there is no way to document the "colname" column since each row in the wideFormat
DataFrame
will indicate a "primary" sample (1 to many relationship).
Thus, the corresponding mcol
annotation would have to point to 5 different samples and this is not possible in a "tidy" way anyway.
> wideDF[1:5, 1:3]
primary COAD_RNASeq2GeneNorm-20160128_A1BG COAD_RNASeq2GeneNorm-20160128_A1CF
<character> <numeric> <numeric>
1 TCGA-3L-AA1B NA NA
2 TCGA-4N-A93T NA NA
3 TCGA-4T-AA8H NA NA
4 TCGA-5M-AAT6 NA NA
5 TCGA-A6-2671 25.418 10.8359
> mcols(wideDF[1:5, 1:3])
DataFrame with 5 rows and 3 columns
sourceName rowname colname
<character> <factor> <factor>
1 colDataRows NA NA
2 COAD_RNASeq2GeneNorm-20160128 A1BG NA
3 COAD_RNASeq2GeneNorm-20160128 A1CF NA
I have implemented the permanent solution of always having a "colname" column in mcols(wideDF)
.
I will push this soon.
Regards, Marcel
I didn't previously realise that there are complexities which restrict the contents of the column names information. However, for my example above, the row names are now also all NA. It seems like the presence of clinical data has caused too much of the metadata to be removed in the latest design.
> mcols(testTable)
DataFrame with 7 rows and 3 columns
sourceName rowname colname
<character> <factor> <factor>
1 colDataRows NA NA
2 colDataCols NA NA
3 RNA...BRAF NA NA
4 RNA...NRAS NA NA
5 RNA...PTEN NA NA
6 Protein...BRAF NA NA
7 Protein...NRAS NA NA
Hi Dario, @DarioS
Thanks for checking this, it's actually a bug with check.names
.
I will fix it shortly.
It should look like this:
> mcols(testTable)
DataFrame with 7 rows and 3 columns
sourceName rowname colname
<character> <factor> <factor>
1 colDataRows NA NA
2 colDataCols NA NA
3 RNA BRAF NA
4 RNA NRAS NA
5 RNA PTEN NA
6 Protein BRAF NA
7 Protein NRAS NA
Regards, Marcel
wideFormat
identifies the variables in the generated table by concatenating identifiers, by default. To do further analysis or plotting, string splitting may be necessary to be able to identify which variable and which experiment is represented by a particular column. Perhaps this information could be made automatically available. To illustrate my idea:Also, some genes in popular databases like GENCODE Genes have underscores in their symbols, such as HOTTIP_1, Hammerhead_HH9, HOTAIR_2. This makes it harder to identify the table and variable of each column from the wide format identifiers.