Closed lgatto closed 3 years ago
I personally like the joinSpectraData,Spectra
as it tells exactly what we want to do - we want to add (columns) to the spectraData
and leave the mz
and intensity
values untouched.
And if possible, I would also implement that only for Spectra
, and not in the backend. Implementing a backend from scratch turns out to be anyway quite extensive and I would not like to require even more functions to be implemented. The joinSpectraData,Spectra
could e.g. use the $<-
from the backend to add the columns.
I would also implement that only for
Spectra
I am not sure. I think we would need a default method for all backends that store their spectraData
in memory (currently all but MsBackendSqlDb
, but we will need backend-specific implementations when the spectraData
lives on disk.
A temporary solution is available in the joinSpectraData
package.
What I learned from implementing the MsBackendMassbankSql
was that we have already a very large number of methods to implement (if the backend is not inheriting from e.g. MsBackendDataFrame
). This was my reasoning why I would add new methods to the backend only as a last possibility. I thought that having already the $<-,MsBackend
would be sufficient. It would not be ideal because one would have to add columns one by one, but it should work.
I totally agree that priority should to implement methods at the Spectra
.
Adding columns by columns isn't practical in practice. However, if I understand your suggestion, would be to implement to wrap the 'add one column at at time' approach in a joinSpectraData,Spectra,DataFrame
method. This might actually be a good solution, given that we don't have a merge,DataFrame,DataFrame
yet. I'll have a go at this.
Not sure if a column by column would be feasible, given that initialising a new column
spectraData(x)[["foo"]] <- 1
is very slow.
On the other hand, a quick and dirty x@backend@spectraData[["bar"]] <- 1
is instantaneous :-)
Update Ok, what I think I need here is not $<-
, but [[<-
, so that I can use a string to refer to the variable name:
x[["foo"]]
x[["foo"]] <- 1
Adding columns by columns isn't practical in practice. However, if I understand your suggestion, would be to implement to wrap the 'add one column at at time' approach in a joinSpectraData,Spectra,DataFrame method. This might actually be a good solution, given that we don't have a merge,DataFrame,DataFrame yet. I'll have a go at this.
That's exactly what I meant.
Yes, spectraData(x)[["foo"]] <- 1
will be slow because it replaces the whole spectra data.
Regarding the [[<-
, I guess it should be possible to use the same code as $<-
internally (or vice versa?).
We could then have a default implementation for MsBackend
so that this will not have to be implemented for all backends.
Do you want to implement the [[<-,MsBackend
or should I give it a try @lgatto ?
The problem is that I can't use $<-
programmatically, I need [[<-
. I won't have time today, so feel tree to do it. The implementation of the two can be the same.
OK, I'm on it. PR will follow soon.
Adding new spectra variables is surprisingly difficult. Here's a detailed and reproducible example of why it is difficult and what we need:
The test data
Easy case
It is straightforward to add a single new column of the right length.
Less easy
We now have data for some rows only. Here, I need to initialise the column with the correct type and default value.
Not easy
Here's some real data to be merged:
spectrumId
orMS.GF.RawScore
variable is like the example above.sequence
, I want to preserve theCharacterList
class (relevant for later):Looks easy but...
Adding one column at a time isn't really an option, so let's try to merge tables. I'm going to use the
merge
(dplyr::left_join()
would also apply).Even though this looks good, the type of the
sequence
variable has been changed fromCharacterList
tolist
, which isn't desirable.spectraData(ms)
,id
and the resultingx
are all of classDFrame
, and even though there is an dedicatedmerge
method, is doesn't do what we need because it converts theDFrame
s todata.frame
s:Update: we can coerce the
list
columns back to what we want withConclusions
merge
that preserves the column types (see the question on the bioc-devel list and this Github issue).joinSpectraData()
function that updates thespectraData()
and handles specific backends, rather than replying on manually `spectraData(x) <- x.For point 1, may be first thing would be to ask on the Bioconductor devel list or slack, as I may be missing something.