Open lgatto opened 7 years ago
Also, as a side-effect, I think that this new implementation is simpler and will lead to simplifications in the code.
I agree that the current implementation is very limited. I really like your idea to support multiple pranges
but I don't like using mcols(x@aa)
for it. It would break the line-by-line relation of sequences and their metadata. Also the use of .get_pranges_indices
seems error-prone.
pranges
is a central feature in Pbase
that's why I think it earns an own slot and we could turn the pranges
slot into an environment
or a list
. This would be similar to your suggestion but without the need for .get_pranges_indices
.
BTW: Do we need an CPU R style guide? I personally don't like the Hadley's snake_case. In general we use the Bioc Coding Style, or?
@lgatto what kind of code simplification do you mean? Regardless of an own pranges
slot or as additional element in mcols(x@aa)
we need to ensure the order and naming of the (Compressed)IRangesList
. The only simplification I could think of would be simpler subsetting operators (because we don't have to subset x@aa
and x@pranges
but IMHO that is negligible).
The code simplifications is that all is contained into a single DataFrame
. This is a well-defined/written data structure maintained by Bioc-core. Having the AAStringSet
's metadata columns and and the pranges
as columns takes care of many of the validity implicitly and addition of new pranges
. For example, adding a new pranges
IRangesList will always match the number of row/proteins. And, as you say, subsetting becomes trivial. We still need to make sure names match, of course, but that's easy too:
stopifnot(all.equal(names(newPranges), names(object))
object@mcols <- cbind(object@mcols, DataFrame(newPranges))
## or, of more efficient
object@mcols[, "newPrangeName"] <- newPranges
I agree that pranges
are important; it's the most important feature of the class/package. But I don't think this needs to be reflected in the implementation. Having an easy way (code) to discriminate the pranges
columns and aa
's metadata exposes this importance.
I think that the simple .get_pranges_indices
is actually quite good. It's simple (that's very good) and I believe the class of the column is a good way to define it it's a valid prange
- could you think of a case where a sequence metadata would be a IRangesList
other than describing ranges for each sequence? This is the definition of a prange
. It would also be possible to have out own PRangesList
class that is a re-branded IRangesList
to avoid to rely on the above assumption.
I don't understand this
It would break the line-by-line relation of sequences and their metadata.
on the contrary, it enforces it by design for the sequence metadata and pranges
. Each pranges
have their own metadata, of course.
Also, I don't agree with the suggestion of having them in a list
or environment
. A DataFrame
is the prefect data structure for this. We could have 2 DataFrames
, with the same row names, but then, why not combine them?
I understand your point and fully agree that using robust and heavily tested bioc classes is better than inventing our own data structure.
In my understanding a (Annotated)DataFrame
is a table like structure and each row correspond to a single sequence (that was my intention of line-by-line relation). Adding a (Compressed)IRangesList
as a DataFrame
column would create a row (the corresponding IRangesList
element) with multiple entries (IRanges
).
But in fact a data.frame
is just a named list
so here is no real point to argue about. I am fine with your suggestion.
I am confused when you say
Adding a
(Compressed)IRangesList
as aDataFrame
column would create a row (the correspondingIRangesList
element) with multiple entries (IRanges
).
Here is how I see it: adding an IRangesList
to the DataFrame
corresponds to adding a columns (an item to the list
, as you point out, with the right length), which, from our point of view, adds an IRanges
to each row/protein. This exactly match your line-by-line relation with each row corresponding to a single sequence.
proteins | mcols | IRangesList |
IRangesList |
---|---|---|---|
Sequence_1 | .... | IRranges _1 |
IRranges _1 |
Sequence_2 | .... | IRranges _2 |
IRranges _2 |
Sequence_n | .... | IRranges _n |
IRranges _n |
Sorry for causing confusion, I meant:
proteins | mcols | IRangesList |
IRangesList |
---|---|---|---|
Sequence_1 | .... | IRranges _1 |
IRranges _1 |
- Range1 1 | - Range1 1 | ||
- Range1 2 | - Range1 2 | ||
- Range1 n | - Range1 n | ||
Sequence_2 | .... | IRranges _2 |
IRranges _2 |
Sequence_n | .... | IRranges _n |
IRranges _n |
But that is not important. It was just my own confusion about the representation of a tree-like list
structure and a table-like data.frame
.
After thinking twice I totally agree with your approach.
good that you did all the discussion already :) I like the idea very much of putting the IRangesList
into the mcols
DataFrame
- no more checking that the mapping protein to ranges fits.
Did you check the code already in? Did you test adding additional pranges?
Going to push soon. Before adding code to add pranges
, I'll make sure that the old code base works.
I will also document here other updates related to the new Proteins
implementation:
pranges
, acols
, pcols
, ... pranges
pranges
to access/use then: pranges
, proteotypic
, ... (see #40)See also #41 for more details about the API
As an info: since https://github.com/ComputationalProteomicsUnit/Pbase/commit/12749cdfae727e9000294fb20b82a46323c1ae6c the Proteins,EnsDb
method fetches proteins and puts the protein domains into the mcols(aa(x))
. Works nicely. Like that.
The issue
Currently, we don't do a good job supporting multiple types of
pranges
. We have on one side an example werepranges
store peptides along proteins. With the newEnsDb
access to protein data, one can easily add other information along the protein sequences, such as protein domain (see thePbase-with-ensembldb
vignette). But, there is no good way to combine these different ranges into a singleProteins
object.Suggested solution
Make the current
pranges
(Compressed)IRangesList
a column ofmcols(x@aa)
. This then allows to easily add additionalIRangesList
(basically extendingmcols(x@aa)
oracols(x)
).Suggested implementation
Example
Expected new functionality
Adding ranges
Any comments/suggestions @ComputationalProteomicsUnit/developer-pbase